Executive Summary: Customer Segmentation Project¶

Most Important Findings: This project aims to leverage unsupervised learning (dimensionality reduction and clustering) to segment customers based on their characteristics and engagement with marketing campaigns. The analysis will identify key customer segments, potentially revealing distinct behavioral patterns and preferences. The impact of segmentation on marketing ROI will be a crucial outcome. Further detail on specific findings is pending completion of the analysis outlined in the notebook.

Proposed Model Specifications: Model specifications will be determined based on analysis results. Unsupervised learning, potentially using PCA for dimensionality reduction and K-means or hierarchical clustering, will be employed. Algorithm selection and hyperparameter tuning will be based on performance metrics (e.g., silhouette score, Davies-Bouldin index). Feature selection will be crucial.

Key Next Steps:

  1. Data Preparation: Further EDA, outlier detection, missing value imputation, and feature engineering based around desired metrics.
  2. Dimensionality Reduction: Apply PCA and find the optimal the number of components.
  3. Clustering: Use clustering algorithms to form customer segments; optimize cluster number and evaluate segmentation quality. Invesitgate the PCA stripes mentioned below in further detail.
  4. Segment Profiling: Analyze segment characteristics (demographics, purchase history, campaign engagement) and create segment profiles around marketing goals.
  5. Model Evaluation and Refinement: Evaluate model performance and refine algorithms and hyperparameters based on key marketing metrics.
  6. Actionable Insights: Develop recommendations for targeted marketing, personalized offers, and improved customer engagement.
  7. Documentation: Document the entire process, including rationale, results, and conclusions to present to clients.

Problem Summary: The business needs a better understanding of its customer base to improve marketing ROI. Currently, there's a lack of customer segmentation, leading to inefficient and potentially ineffective marketing campaigns. Targeted marketing efforts are hampered by an inability to identify distinct customer groups with varying needs and preferences.

Proposed Solution Design Rationale: Unsupervised machine learning (specifically dimensionality reduction and clustering) will be used to segment customers based on their observed characteristics and behaviors. This approach allows the identification of inherent groupings within the customer data without pre-defined labels, thus uncovering hidden patterns. Dimensionality reduction techniques like PCA will simplify the data, while clustering algorithms (like K-means or hierarchical clustering) will group customers with similar characteristics into segments.

Business Impact: Effective customer segmentation will lead to improved marketing ROI by enabling targeted campaigns. By tailoring marketing messages and offers to specific customer segments, the business can expect higher conversion rates and increased customer engagement. The identification of distinct customer profiles will allow for more personalized experiences, leading to greater customer satisfaction and loyalty. Ultimately, the project aims to drive revenue growth by optimizing marketing spend and improving overall customer relationship management.

Recommendations for Implementation: Customer Segmentation Project¶

Key Recommendations:

  1. Prioritize Data Quality: Focus on thorough data cleaning and preparation. This includes robust outlier detection and handling, comprehensive missing value imputation strategies (justify the methods used), and thoughtful feature engineering aligned with business objectives and marketing goals. The success of the model heavily depends on the quality of the input data.
  2. Feature Selection Rigor: Carefully select relevant features for segmentation. Avoid using irrelevant or redundant features, which can negatively impact the performance of the dimensionality reduction and clustering algorithms. Employ feature importance analysis techniques to guide selection. Document rationale for inclusion/exclusion of each feature.
  3. Clustering Algorithm Validation: Experiment with various clustering algorithms (K-means, hierarchical, DBSCAN, etc.) and rigorously compare their performance. Evaluate using silhouette scores and visual representations. Consider the interpretability of each algorithm's output.
  4. Segment Profiling Depth: Go beyond basic descriptive statistics when profiling segments. Identify actionable insights within each segment. Visualize the distributions of key variables within each segment to highlight differentiators. Quantify the impact of segment differences on marketing ROI.
  5. Iterative Refinement: Embrace an iterative approach. Continuously evaluate the model's performance using relevant business metrics (e.g., conversion rates, customer lifetime value) and refine the model (parameters and features) based on real-world feedback and evolving business needs.
  6. Robust Documentation: Document every step of the process, including data preprocessing techniques, algorithm choices, hyperparameter settings, evaluation metrics, segment profiles, and actionable insights. This will be critical for communicating the results and their implications to stakeholders.
  7. Data Processing Code data so that it is interpretable by the various dimensionality reduction and clustering models.

Key Actionables for Stakeholders:

  • Marketing Team: Provide feedback on segment profiles and validate insights against their domain expertise. Participate in the development of targeted marketing strategies.
  • Sales Team: Utilize the segment profiles to tailor sales approaches and personalize customer interactions.
  • Executive Leadership: Review the business impact analysis and approve budget for implementation of the recommendations.

Expected Benefits and Costs:

  • Benefits:
    • Increased Marketing ROI: Improved targeting leading to higher conversion rates (assume a 10-15% increase in conversion rates for targeted segments).
    • Improved Customer Engagement: Personalized offers leading to increased customer satisfaction and loyalty (assume a 5-10% increase in customer retention).
    • Enhanced Customer Lifetime Value: Stronger customer relationships lead to increased revenue over the customer's lifetime. (Quantify this using past data if available).
    • Optimized Marketing Spend: More efficient resource allocation due to better targeting.
  • Costs:
    • Data Acquisition/Preparation: Time investment for data cleaning, preprocessing, and feature engineering (estimate hours and associated labor costs).
    • Model Development & Maintenance: Time investment for model building, evaluation, and deployment; ongoing maintenance costs. (Estimate hours and associated labor costs).
    • IT Infrastructure: Potential investment in cloud resources or software licenses (estimate recurring cost).
    • Bad Data Data may not always be accurate and may mislead decision makers leading a loss of sales and revenue.

Rational Assumptions (example):

  • A 10% increase in conversion rates for targeted campaigns translates to an additional $X in revenue (estimate X based on current conversion rates and average order value).
  • A 5% increase in customer retention reduces customer acquisition cost by $Y. (Estimate Y based on historical data).
  • Each data scientist spends 'Z' hours per week on the project over 'N' weeks for a labor cost of $A. (Estimate hours, cost per hour, and timeframe).

Potential Risks and Challenges:

  • Data quality issues: Inaccurate or incomplete data could lead to unreliable segments.
  • Model complexity: Overly complex models might be difficult to interpret or deploy.
  • Resistance to change: Stakeholders might be hesitant to adopt new strategies.
  • Interpretability challenges: Difficulty in explaining the logic behind the segments to stakeholders.

Further Analysis:

  • Competitor analysis: Analyze competitor segmentation strategies.
  • Longitudinal analysis: Track segment performance over time to assess the stability of the segments and adapt strategies.
  • A/B testing: Rigorously test marketing campaigns targeted at different segments.
  • Causality vs. Correlation: While the model identifies patterns, further analysis needs to verify the causal relationship between segmentation and business outcomes. Consider potential confounding factors.

Note to readers: This detailed implementation plan should help guide the project to successful completion and provide measurable business value. Remember to adjust assumptions and quantifiable estimations based on specific business circumstances.

Capstone Project¶

Customer Segmentation¶

Context¶

Customer segmentation is the process of dividing a dataset of customers into groups of similar customers based on certain common characteristics, usually for the purpose of understanding the population dataset in a better fashion. Understanding customer behavior and characteristics is usually a critical part of the marketing operations of any business or organization, with direct consequences on sales & marketing strategy. Customer segmentation is often viewed as a means to achieve a better return on investment from marketing efforts, and make organizations more efficient in terms of utilizing their money, time, and other critical resources in custom marketing strategies for different groups of customers based on their unique needs and motivations.

For example, it has been understood from various research that customer segmentation often has a huge impact on people's email engagement. Segmented campaigns often see over 100% more clicks than non-segmented campaigns, and email marketers who have segmented their audience before campaigning have reported a 6-7 times growth in their overall revenue. It has also been observed in various contexts that in today's world, individual customers prefer personalized communications and offerings that cater to their particular interests.

In the context of marketing analytics then, customer segmentation has a vital role to play in optimizing ROI. It typically involves analyzing metrics around customer engagement with various marketing activities including but not limited to, ATL (above the line) marketing activities, BTL (below the line) campaigns, and targeting personalized offers. Typically, the variables of interest are customer profiles, campaign conversion rates, and information associated with various marketing channels. Based on these feature categories, the target is to create the best possible customer segments from the given data.

Objective¶

Using Unsupervised Learning ideas such as Dimensionality Reduction and Clustering, the objective is to come up with the best possible customer segments using the given customer dataset.

Marketing Campaign Analysis¶

Problem Definition¶

The Context:¶

  • Why is this problem important to solve?

Effective customer segmentation is crucial for optimizing marketing ROI and improving customer engagement. Businesses invest significant resources in marketing campaigns, and without understanding customer behavior and preferences, these efforts may be inefficient or ineffective. Personalized marketing and targeted offers based on customer segments can lead to higher conversion rates, increased revenue, and stronger customer relationships. Furthermore, understanding different customer segments can inform product development, pricing strategies, and overall business strategy. Failing to segment customers risks wasting marketing budget and missing opportunities to tailor offerings to specific customer needs, leading to a competitive disadvantage.

The objective:¶

  • What is the intended goal?

The primary goal is to develop a robust customer segmentation model using unsupervised learning techniques. This model should effectively group customers based on their characteristics and interactions with marketing campaigns, ultimately enabling more effective and efficient marketing strategies. The model's success will be measured by its ability to identify distinct customer segments with clear behavioral patterns, leading to actionable insights for personalized offers and targeted campaigns that improve key marketing metrics like ROI and customer engagement.

The key questions:¶

  • What are the key questions that need to be answered?
  • How many distinct customer segments exist within the dataset?
  • What are the key characteristics and behaviors that define each segment?
  • How do these segments differ in their responses to marketing campaigns (e.g., conversion rates, click-through rates, engagement levels)?
  • What are the optimal marketing strategies (messaging, channel selection, offer types) for each identified segment?
  • What is the potential impact of segment-specific marketing strategies on key business metrics (e.g., revenue, customer lifetime value, customer retention)?
  • What are the most relevant and least relevant metrics to measure customer engagement?
  • How can we best visualize and interpret the segments, making them useful for business users?

The problem formulation:¶

  • What is it that we are trying to solve using data science?

We are using unsupervised learning, specifically dimensionality reduction (e.g., PCA) and clustering (e.g., K-means, hierarchical clustering), to identify optimal customer segments from a given dataset. The dataset likely contains customer demographics, purchase history, and interactions with various marketing channels and campaigns. The model will aim to identify underlying patterns and relationships among these features to group customers into homogeneous segments. We'll evaluate the model using metrics like silhouette scores and Davies-Bouldin indexes, with a primary focus on generating actionable insights that enhance marketing ROI and customer engagement. The challenge lies in finding the right balance between the number of segments, the homogeneity of each segment, and the interpretability of segment profiles in the context of marketing strategy.


Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2025.

Import the necessary libraries and load the data¶

In [1]:
# Uninstall the specified packages
# !pip uninstall -y numpy pandas scikit-learn sklearn_extra

# Reinstall the specified packages, ensuring the correct order
!pip install numpy==1.24.3
Requirement already satisfied: numpy==1.24.3 in /usr/local/lib/python3.11/dist-packages (1.24.3)
In [2]:
import numpy as np  # Import NumPy first
print(f"NumPy version: {np.__version__}")
NumPy version: 1.24.3
In [3]:
!pip install pandas==2.0.3
Requirement already satisfied: pandas==2.0.3 in /usr/local/lib/python3.11/dist-packages (2.0.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.11/dist-packages (from pandas==2.0.3) (1.24.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0)
In [4]:
!pip install scikit-learn
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (1.6.1)
Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.24.3)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.14.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (3.6.0)
In [5]:
# Install a compatible version of sklearn_extra
!pip install scikit-learn-extra==0.3.0
Requirement already satisfied: scikit-learn-extra==0.3.0 in /usr/local/lib/python3.11/dist-packages (0.3.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra==0.3.0) (1.24.3)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra==0.3.0) (1.14.1)
Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra==0.3.0) (1.6.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra==0.3.0) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra==0.3.0) (3.6.0)
In [6]:
# import sklearn_extra  # Import sklearn_extra for its version
# import sklearn # Import sklearn for its version
import pandas as pd
import sklearn
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import cophenet

# Check installed versions
print(f"Scikit-learn version: {sklearn.__version__}")  # Now sklearn is defined
# print(f"sklearn_extra version: {sklearn_extra.__version__}")  # Now sklearn_extra is defined
Scikit-learn version: 1.6.1
In [7]:
import numpy as np
# np.import_array() # Force NumPy to import its array functions explicitly
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn_extra.cluster import KMedoids

from scipy.cluster.hierarchy import fcluster  # Import fcluster
from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings('ignore')

#loading the data
drive.mount('/content/drive')
# reading in the data and getting the overview
df_temp = pd.read_csv('/content/drive/MyDrive/marketing_campaign.csv')
df = df_temp.copy() # so as not to change the original data
Mounted at /content/drive

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
  • Drop the column which has no null values
In [8]:
df.head()
Out[8]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 27 columns

In [9]:
df.shape
Out[9]:
(2240, 27)
In [10]:
# Checking the data types
print(df.dtypes)
df.info()
ID                       int64
Year_Birth               int64
Education               object
Marital_Status          object
Income                 float64
Kidhome                  int64
Teenhome                 int64
Dt_Customer             object
Recency                  int64
MntWines                 int64
MntFruits                int64
MntMeatProducts          int64
MntFishProducts          int64
MntSweetProducts         int64
MntGoldProds             int64
NumDealsPurchases        int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
AcceptedCmp3             int64
AcceptedCmp4             int64
AcceptedCmp5             int64
AcceptedCmp1             int64
AcceptedCmp2             int64
Complain                 int64
Response                 int64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
In [11]:
df.isnull().sum()
Out[11]:
0
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Response 0

In [12]:
df.duplicated().sum()
Out[12]:
0
In [13]:
# dropping ID column
df = df.drop('ID', axis=1)
print("Column 'ID' dropped successfully.")
Column 'ID' dropped successfully.

Observations and Insights from the Data overview:¶

  • Data set has 2240 Rows and 27 columns (26 after I dropped the ID column)
  • Data set consists of columns of strings, integers and one float (income).
  • Income is the only column with missing values (NaN).
  • No duplicates
  • Dropped ID column because it is not relevant to any models.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
  2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
  3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
In [14]:
# Summary statistics for numerical variables
print(df.describe().T)

print("\n\n")
# Summary statistics for categorical variables
print(df.describe(include=['object']).T)
                      count          mean           std     min       25%  \
Year_Birth           2240.0   1968.805804     11.984069  1893.0   1959.00   
Income               2216.0  52247.251354  25173.076661  1730.0  35303.00   
Kidhome              2240.0      0.444196      0.538398     0.0      0.00   
Teenhome             2240.0      0.506250      0.544538     0.0      0.00   
Recency              2240.0     49.109375     28.962453     0.0     24.00   
MntWines             2240.0    303.935714    336.597393     0.0     23.75   
MntFruits            2240.0     26.302232     39.773434     0.0      1.00   
MntMeatProducts      2240.0    166.950000    225.715373     0.0     16.00   
MntFishProducts      2240.0     37.525446     54.628979     0.0      3.00   
MntSweetProducts     2240.0     27.062946     41.280498     0.0      1.00   
MntGoldProds         2240.0     44.021875     52.167439     0.0      9.00   
NumDealsPurchases    2240.0      2.325000      1.932238     0.0      1.00   
NumWebPurchases      2240.0      4.084821      2.778714     0.0      2.00   
NumCatalogPurchases  2240.0      2.662054      2.923101     0.0      0.00   
NumStorePurchases    2240.0      5.790179      3.250958     0.0      3.00   
NumWebVisitsMonth    2240.0      5.316518      2.426645     0.0      3.00   
AcceptedCmp3         2240.0      0.072768      0.259813     0.0      0.00   
AcceptedCmp4         2240.0      0.074554      0.262728     0.0      0.00   
AcceptedCmp5         2240.0      0.072768      0.259813     0.0      0.00   
AcceptedCmp1         2240.0      0.064286      0.245316     0.0      0.00   
AcceptedCmp2         2240.0      0.012946      0.113069     0.0      0.00   
Complain             2240.0      0.009375      0.096391     0.0      0.00   
Response             2240.0      0.149107      0.356274     0.0      0.00   

                         50%       75%       max  
Year_Birth            1970.0   1977.00    1996.0  
Income               51381.5  68522.00  666666.0  
Kidhome                  0.0      1.00       2.0  
Teenhome                 0.0      1.00       2.0  
Recency                 49.0     74.00      99.0  
MntWines               173.5    504.25    1493.0  
MntFruits                8.0     33.00     199.0  
MntMeatProducts         67.0    232.00    1725.0  
MntFishProducts         12.0     50.00     259.0  
MntSweetProducts         8.0     33.00     263.0  
MntGoldProds            24.0     56.00     362.0  
NumDealsPurchases        2.0      3.00      15.0  
NumWebPurchases          4.0      6.00      27.0  
NumCatalogPurchases      2.0      4.00      28.0  
NumStorePurchases        5.0      8.00      13.0  
NumWebVisitsMonth        6.0      7.00      20.0  
AcceptedCmp3             0.0      0.00       1.0  
AcceptedCmp4             0.0      0.00       1.0  
AcceptedCmp5             0.0      0.00       1.0  
AcceptedCmp1             0.0      0.00       1.0  
AcceptedCmp2             0.0      0.00       1.0  
Complain                 0.0      0.00       1.0  
Response                 0.0      0.00       1.0  



               count unique         top  freq
Education       2240      5  Graduation  1127
Marital_Status  2240      8     Married   864
Dt_Customer     2240    663  31-08-2012    12

Observations:¶

  • 24 rows have income at null. Solution: remove those rows since there are very few of those.
In [15]:
df = df.dropna()
print("Rows with income at null dropped successfully.")
print(df.describe().T)
Rows with income at null dropped successfully.
                      count          mean           std     min      25%  \
Year_Birth           2216.0   1968.820397     11.985554  1893.0   1959.0   
Income               2216.0  52247.251354  25173.076661  1730.0  35303.0   
Kidhome              2216.0      0.441787      0.536896     0.0      0.0   
Teenhome             2216.0      0.505415      0.544181     0.0      0.0   
Recency              2216.0     49.012635     28.948352     0.0     24.0   
MntWines             2216.0    305.091606    337.327920     0.0     24.0   
MntFruits            2216.0     26.356047     39.793917     0.0      2.0   
MntMeatProducts      2216.0    166.995939    224.283273     0.0     16.0   
MntFishProducts      2216.0     37.637635     54.752082     0.0      3.0   
MntSweetProducts     2216.0     27.028881     41.072046     0.0      1.0   
MntGoldProds         2216.0     43.965253     51.815414     0.0      9.0   
NumDealsPurchases    2216.0      2.323556      1.923716     0.0      1.0   
NumWebPurchases      2216.0      4.085289      2.740951     0.0      2.0   
NumCatalogPurchases  2216.0      2.671029      2.926734     0.0      0.0   
NumStorePurchases    2216.0      5.800993      3.250785     0.0      3.0   
NumWebVisitsMonth    2216.0      5.319043      2.425359     0.0      3.0   
AcceptedCmp3         2216.0      0.073556      0.261106     0.0      0.0   
AcceptedCmp4         2216.0      0.074007      0.261842     0.0      0.0   
AcceptedCmp5         2216.0      0.073105      0.260367     0.0      0.0   
AcceptedCmp1         2216.0      0.064079      0.244950     0.0      0.0   
AcceptedCmp2         2216.0      0.013087      0.113672     0.0      0.0   
Complain             2216.0      0.009477      0.096907     0.0      0.0   
Response             2216.0      0.150271      0.357417     0.0      0.0   

                         50%       75%       max  
Year_Birth            1970.0   1977.00    1996.0  
Income               51381.5  68522.00  666666.0  
Kidhome                  0.0      1.00       2.0  
Teenhome                 0.0      1.00       2.0  
Recency                 49.0     74.00      99.0  
MntWines               174.5    505.00    1493.0  
MntFruits                8.0     33.00     199.0  
MntMeatProducts         68.0    232.25    1725.0  
MntFishProducts         12.0     50.00     259.0  
MntSweetProducts         8.0     33.00     262.0  
MntGoldProds            24.5     56.00     321.0  
NumDealsPurchases        2.0      3.00      15.0  
NumWebPurchases          4.0      6.00      27.0  
NumCatalogPurchases      2.0      4.00      28.0  
NumStorePurchases        5.0      8.00      13.0  
NumWebVisitsMonth        6.0      7.00      20.0  
AcceptedCmp3             0.0      0.00       1.0  
AcceptedCmp4             0.0      0.00       1.0  
AcceptedCmp5             0.0      0.00       1.0  
AcceptedCmp1             0.0      0.00       1.0  
AcceptedCmp2             0.0      0.00       1.0  
Complain                 0.0      0.00       1.0  
Response                 0.0      0.00       1.0  
In [16]:
# Number of unique observations in each category of categorical columns
for col in df.select_dtypes(include=['object']):
    print(f"\nColumn: {col}")
    print(df[col].value_counts())
Column: Education
Education
Graduation    1116
PhD            481
Master         365
2n Cycle       200
Basic           54
Name: count, dtype: int64

Column: Marital_Status
Marital_Status
Married     857
Together    573
Single      471
Divorced    232
Widow        76
Alone         3
Absurd        2
YOLO          2
Name: count, dtype: int64

Column: Dt_Customer
Dt_Customer
31-08-2012    12
12-09-2012    11
14-02-2013    11
12-05-2014    11
20-08-2013    10
              ..
05-08-2012     1
18-11-2012     1
09-05-2014     1
26-06-2013     1
09-01-2014     1
Name: count, Length: 662, dtype: int64

Observations:¶

  • Absurd and YOLO are unserious answers. Solution: drop those rows since they may not hold valid information.
  • Alone could be Single, widowed or divorced can be merged. Solution: also drop those rows since there are only 3 of them.
In [17]:
# Dropping all rows where marital status is "Alone", "Absurd", or "YOLO"

marital_status_to_drop = ["Alone", "Absurd", "YOLO"]
df = df[~df["Marital_Status"].isin(marital_status_to_drop)]
print(f"Rows with marital status {marital_status_to_drop} dropped successfully.")

# Analysis of 'Marital_Status' column
print("\nAnalysis of 'Marital_Status' column:")
print(df['Marital_Status'].value_counts())
Rows with marital status ['Alone', 'Absurd', 'YOLO'] dropped successfully.

Analysis of 'Marital_Status' column:
Marital_Status
Married     857
Together    573
Single      471
Divorced    232
Widow        76
Name: count, dtype: int64
In [18]:
# Replacing all occurrences of "2n Cycle" with "Master" in the "Education" column
df['Education'] = df['Education'].replace('2n Cycle', 'Master')

# Analysis of 'Education' column
print("\nAnalysis of 'Education' column:")
print(df['Education'].value_counts())
Analysis of 'Education' column:
Education
Graduation    1114
Master         563
PhD            478
Basic           54
Name: count, dtype: int64
In [19]:
df.shape
Out[19]:
(2209, 26)

Observations:¶

  • Basic might mean High school and Graduation might mean bachelor or vice versa. Hard to say what those categories represent, so leaving them as is.
  • From 2240 rows down to 2209 after removing missing or bad data.

Univariate Analysis on Numerical and Categorical data¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

  • Plot histogram and box plot for different numerical features and understand how the data looks like.
  • Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. How does the distribution of Income variable vary across the dataset?
  2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
  3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
In [20]:
# Plot histogram and box plot for numerical features
numerical_features = ['Year_Birth', 'Income', 'Recency', 'MntFishProducts', 'MntMeatProducts', 'MntFruits',
                      'MntSweetProducts', 'MntWines', 'MntGoldProds', 'NumDealsPurchases', 'NumCatalogPurchases',
                      'NumStorePurchases', 'NumWebPurchases', 'NumWebVisitsMonth']

for feature in numerical_features:
    plt.figure(figsize=(12, 4))

    #histogram and boxplot side by side
    plt.subplot(1, 2, 1)
    sns.histplot(df[feature], kde=True)
    plt.title(f'Histogram of {feature}')

    plt.subplot(1, 2, 2)
    sns.boxplot(y=df[feature])
    plt.title(f'Boxplot of {feature}')

    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:¶

  • Number of Web Visits and Number of Web Purchases per month show similar box plots, no surprise.
  • Recency and number of Store purchases are the only ones without outliers.
  • Recency seems uniformly distributed
  • Most are right skewed, which is no surprise since there are no negative values and lots of crowding just above 0.
In [21]:
# Exploring the categorical variables like Education, Kidhome, Teenhome, Complain, and any others.
categorical_features = ['Education', 'Kidhome', 'Teenhome', 'Complain', 'Marital_Status']

for feature in categorical_features:
    plt.figure(figsize=(8, 6))
    sns.countplot(x=feature, data=df)
    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
    plt.show()

    print(f"\nValue counts for {feature}:\n{df[feature].value_counts()}\n")
No description has been provided for this image
Value counts for Education:
Education
Graduation    1114
Master         563
PhD            478
Basic           54
Name: count, dtype: int64

No description has been provided for this image
Value counts for Kidhome:
Kidhome
0    1279
1     884
2      46
Name: count, dtype: int64

No description has been provided for this image
Value counts for Teenhome:
Teenhome
0    1144
1    1014
2      51
Name: count, dtype: int64

No description has been provided for this image
Value counts for Complain:
Complain
0    2188
1      21
Name: count, dtype: int64

No description has been provided for this image
Value counts for Marital_Status:
Marital_Status
Married     857
Together    573
Single      471
Divorced    232
Widow        76
Name: count, dtype: int64

Observations:¶

  • Education is mostly "Graduation"
  • Very few complaints which could imply, happy customers.
  • Mostly 0 or 1 kid/teen at home.
  • Mostly Together or Married
In [22]:
# Calculate the upper whisker for the 'Income' variable
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR

# Find the percentile of the upper whisker
income_lvl = df['Income'][df['Income'] <= upper_whisker].quantile(1)

print(f"The upper whisker for income lies at the {income_lvl:.2f} income.")
The upper whisker for income lies at the 113734.00 income.
In [23]:
# 1. How does the distribution of Income variable vary across the dataset?
# a: it varies fairly evenly except for a few outliers around 150,000.

# 2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
# a: removing one, see below.

# 3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them?
# a: We wont treat them since the remaining values are not very extreme.

# Analyze the distribution of 'Income'
plt.figure(figsize=(12, 4))

# Histogram
plt.subplot(1, 2, 1)
sns.histplot(df['Income'], kde=True)
plt.title('Distribution of Income')

# Box plot
plt.subplot(1, 2, 2)
sns.boxplot(y=df['Income'])
plt.title('Boxplot of Income')

plt.tight_layout()
plt.show()

# Further exploration and analysis of extreme income values
# examine rows with income above a certain threshold
extreme_income_threshold = income_lvl
extreme_income_rows = df[df['Income'] > extreme_income_threshold]
print("\nRows with extremely high incomes:")
extreme_income_rows
No description has been provided for this image
Rows with extremely high incomes:
Out[23]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
164 1973 PhD Married 157243.0 0 1 01-03-2014 98 20 2 ... 22 0 0 0 0 0 0 0 0 0
617 1976 PhD Together 162397.0 1 1 03-06-2013 31 85 1 ... 0 1 1 0 0 0 0 0 0 0
655 1975 Graduation Divorced 153924.0 0 0 07-02-2014 81 1 1 ... 0 0 0 0 0 0 0 0 0 0
687 1982 PhD Married 160803.0 0 0 04-08-2012 21 55 16 ... 28 1 0 0 0 0 0 0 0 0
1300 1971 Master Together 157733.0 1 0 04-06-2013 37 39 1 ... 0 1 1 0 0 0 0 0 0 0
1653 1977 Graduation Together 157146.0 0 0 29-04-2013 13 1 0 ... 28 0 1 0 0 0 0 0 0 0
2132 1949 PhD Married 156924.0 0 0 29-08-2013 85 2 1 ... 0 0 0 0 0 0 0 0 0 0
2233 1977 Graduation Together 666666.0 1 0 02-06-2013 23 9 14 ... 1 3 6 0 0 0 0 0 0 0

8 rows × 26 columns

In [24]:
# There is only one value that is truly extreme which has the income at 666,666, This is likely a bad value given a coincidental series of the number 6, so we will remove that row.
df = df[df['Income'] != 666666.]
print("Row with income at 666,666.0 dropped successfully.")
# the remaining outliers are not very extreme, but there are so few of them so we will drop them too.
df = df[df['Income'] < income_lvl]
print("Rows with income above the upper whisker dropped successfully.")
Row with income at 666,666.0 dropped successfully.
Rows with income above the upper whisker dropped successfully.
In [25]:
# Continue exploring numerical features with more detailed  statistics
for feature in numerical_features:
    # Skewness
    print(f"Skewness of {feature}: {df[feature].skew()}")
    print("-" * 30)
Skewness of Year_Birth: -0.3527245697160486
------------------------------
Skewness of Income: 0.004845021990835766
------------------------------
Skewness of Recency: -0.0030723804940621124
------------------------------
Skewness of MntFishProducts: 1.912855852859832
------------------------------
Skewness of MntMeatProducts: 1.8150293359162661
------------------------------
Skewness of MntFruits: 2.0982547491943238
------------------------------
Skewness of MntSweetProducts: 2.0636987685700747
------------------------------
Skewness of MntWines: 1.1650323432442231
------------------------------
Skewness of MntGoldProds: 1.8328376708426588
------------------------------
Skewness of NumDealsPurchases: 2.3254914645573184
------------------------------
Skewness of NumCatalogPurchases: 1.3664475612507232
------------------------------
Skewness of NumStorePurchases: 0.7089275674031636
------------------------------
Skewness of NumWebPurchases: 0.9956790673284797
------------------------------
Skewness of NumWebVisitsMonth: 0.23431698182902091
------------------------------

Observations:¶

  • Mostly positively/right skewed.
  • Year of birth is strongly negatively skewed, meaning that it is a younger crowd primarily, but upon further inspection it looks like there might be some bad data.
In [26]:
# Now display year of birth outliers and remove rows with those outliers

# Identify outliers in 'Year_Birth'
Q1_year = df['Year_Birth'].quantile(0.25)
Q3_year = df['Year_Birth'].quantile(0.75)
IQR_year = Q3_year - Q1_year
lower_bound_year = Q1_year - 1.5 * IQR_year
upper_bound_year = Q3_year + 1.5 * IQR_year

outliers_year = df[(df['Year_Birth'] < lower_bound_year) | (df['Year_Birth'] > upper_bound_year)]

print("Year of birth outliers:")
print(outliers_year.T)

# Remove rows with 'Year_Birth' outliers
df = df[(df['Year_Birth'] >= lower_bound_year) & (df['Year_Birth'] <= upper_bound_year)]

print("\nRows with 'Year_Birth' outliers removed.")
df.shape
Year of birth outliers:
                            192         239         339
Year_Birth                 1900        1893        1899
Education                Master      Master         PhD
Marital_Status         Divorced      Single    Together
Income                  36640.0     60182.0     83532.0
Kidhome                       1           0           0
Teenhome                      0           1           0
Dt_Customer          26-09-2013  17-05-2014  26-09-2013
Recency                      99          23          36
MntWines                     15           8         755
MntFruits                     6           0         144
MntMeatProducts               8           5         562
MntFishProducts               7           7         104
MntSweetProducts              4           0          64
MntGoldProds                 25           2         224
NumDealsPurchases             1           1           1
NumWebPurchases               2           1           4
NumCatalogPurchases           1           0           6
NumStorePurchases             2           2           4
NumWebVisitsMonth             5           4           1
AcceptedCmp3                  0           0           0
AcceptedCmp4                  0           0           0
AcceptedCmp5                  0           0           1
AcceptedCmp1                  0           0           0
AcceptedCmp2                  0           0           0
Complain                      1           0           0
Response                      0           0           0

Rows with 'Year_Birth' outliers removed.
Out[26]:
(2197, 26)
In [27]:
# Display rows where 'NumWebPurchases' > 'NumWebVisitsMonth'
flawed_data = df[df['NumWebPurchases'] > df['NumWebVisitsMonth']]
print("Rows where 'NumWebPurchases' > 'NumWebVisitsMonth':")
print(flawed_data.T)

# There are some rows where the number of Web Purchases exceeds the number of web visits.
# At first glance this seems flawed but these could be subscriptions or autopurchases.
Rows where 'NumWebPurchases' > 'NumWebVisitsMonth':
                           0           2           6           12    \
Year_Birth                 1957        1965        1971        1959   
Education            Graduation  Graduation  Graduation  Graduation   
Marital_Status           Single    Together    Divorced    Divorced   
Income                  58138.0     71613.0     55635.0     63033.0   
Kidhome                       0           0           0           0   
Teenhome                      0           0           1           0   
Dt_Customer          04-09-2012  21-08-2013  13-11-2012  15-11-2013   
Recency                      58          26          34          82   
MntWines                    635         426         235         194   
MntFruits                    88          49          65          61   
MntMeatProducts             546         127         164         480   
MntFishProducts             172         111          50         225   
MntSweetProducts             88          21          49         112   
MntGoldProds                 88          42          27          30   
NumDealsPurchases             3           1           4           1   
NumWebPurchases               8           8           7           3   
NumCatalogPurchases          10           2           3           4   
NumStorePurchases             4          10           7           8   
NumWebVisitsMonth             7           4           6           2   
AcceptedCmp3                  0           0           0           0   
AcceptedCmp4                  0           0           0           0   
AcceptedCmp5                  0           0           0           0   
AcceptedCmp1                  0           0           0           0   
AcceptedCmp2                  0           0           0           0   
Complain                      0           0           0           0   
Response                      1           0           0           0   

                           15          18          23          29    \
Year_Birth                 1946        1949        1954        1965   
Education                   PhD      Master         PhD         PhD   
Marital_Status           Single     Married     Married     Married   
Income                  82800.0     76995.0     65324.0     84618.0   
Kidhome                       0           0           0           0   
Teenhome                      0           1           1           0   
Dt_Customer          24-11-2012  28-03-2013  11-01-2014  22-11-2013   
Recency                      23          91           0          96   
MntWines                   1006        1012         384         684   
MntFruits                    22          80           0         100   
MntMeatProducts             115         498         102         801   
MntFishProducts              59           0          21          21   
MntSweetProducts             68          16          32          66   
MntGoldProds                 45         176           5           0   
NumDealsPurchases             1           2           3           1   
NumWebPurchases               7          11           6           6   
NumCatalogPurchases           6           4           2           9   
NumStorePurchases            12           9           9          10   
NumWebVisitsMonth             3           5           4           2   
AcceptedCmp3                  0           0           0           0   
AcceptedCmp4                  0           0           0           0   
AcceptedCmp5                  1           0           0           1   
AcceptedCmp1                  1           1           0           0   
AcceptedCmp2                  0           0           0           0   
Complain                      0           0           0           0   
Response                      1           0           0           0   

                           36          39    ...        2213        2217  \
Year_Birth                 1970        1943  ...        1995        1948   
Education                   PhD         PhD  ...      Master         PhD   
Marital_Status          Married    Divorced  ...      Single       Widow   
Income                  67353.0     48948.0  ...     80617.0     82032.0   
Kidhome                       0           0  ...           0           0   
Teenhome                      1           0  ...           0           0   
Dt_Customer          31-12-2013  01-02-2013  ...  12-10-2012  05-04-2014   
Recency                      37          53  ...          42          54   
MntWines                    702         437  ...         594         332   
MntFruits                    17           8  ...          51         194   
MntMeatProducts             151         206  ...         631         377   
MntFishProducts               0         160  ...          72         149   
MntSweetProducts              8          49  ...          55         125   
MntGoldProds                 35          42  ...          32          57   
NumDealsPurchases             5           2  ...           1           0   
NumWebPurchases               5           7  ...           4           4   
NumCatalogPurchases           6          10  ...           6           6   
NumStorePurchases            12           5  ...           8           7   
NumWebVisitsMonth             2           6  ...           2           1   
AcceptedCmp3                  0           1  ...           0           0   
AcceptedCmp4                  0           0  ...           0           0   
AcceptedCmp5                  0           0  ...           0           0   
AcceptedCmp1                  0           0  ...           0           0   
AcceptedCmp2                  0           0  ...           0           0   
Complain                      0           0  ...           0           0   
Response                      0           1  ...           0           0   

                           2221        2224        2225        2226  \
Year_Birth                 1982        1970        1968        1979   
Education                Master         PhD  Graduation  Graduation   
Marital_Status           Single      Single    Together     Married   
Income                  75777.0     57642.0     58554.0     63777.0   
Kidhome                       0           0           1           1   
Teenhome                      0           1           1           1   
Dt_Customer          04-07-2013  09-01-2014  26-09-2012  24-03-2013   
Recency                      12          24          55          87   
MntWines                    712         580         368         457   
MntFruits                    26           6          24           5   
MntMeatProducts             538          58          68         106   
MntFishProducts              69           8          38          15   
MntSweetProducts             13           0           0          17   
MntGoldProds                 80          27          88          53   
NumDealsPurchases             1           3           6           8   
NumWebPurchases               3           7           8          11   
NumCatalogPurchases           6           6           2           1   
NumStorePurchases            11           6           6           6   
NumWebVisitsMonth             1           4           7           8   
AcceptedCmp3                  0           1           0           0   
AcceptedCmp4                  1           0           0           0   
AcceptedCmp5                  1           0           0           0   
AcceptedCmp1                  0           0           0           0   
AcceptedCmp2                  0           0           0           0   
Complain                      0           0           0           0   
Response                      1           0           0           0   

                           2231        2235        2236        2238  
Year_Birth                 1970        1967        1946        1956  
Education                Master  Graduation         PhD      Master  
Marital_Status           Single     Married    Together    Together  
Income                  44802.0     61223.0     64014.0     69245.0  
Kidhome                       0           0           2           0  
Teenhome                      0           1           1           1  
Dt_Customer          21-08-2012  13-06-2013  10-06-2014  24-01-2014  
Recency                      71          46          56           8  
MntWines                    853         709         406         428  
MntFruits                    10          43           0          30  
MntMeatProducts             143         182          30         214  
MntFishProducts              13          42           0          80  
MntSweetProducts             10         118           0          30  
MntGoldProds                 20         247           8          61  
NumDealsPurchases             2           2           7           2  
NumWebPurchases               9           9           8           6  
NumCatalogPurchases           4           3           2           5  
NumStorePurchases            12           4           5          10  
NumWebVisitsMonth             8           5           7           3  
AcceptedCmp3                  0           0           0           0  
AcceptedCmp4                  0           0           0           0  
AcceptedCmp5                  0           0           0           0  
AcceptedCmp1                  0           0           1           0  
AcceptedCmp2                  0           0           0           0  
Complain                      0           0           0           0  
Response                      0           0           0           0  

[26 rows x 860 columns]

Bivariate Analysis¶

  • Analyze different categorical and numerical variables and check how different variables are related to each other.
  • Check the relationship of numerical variables with categorical variables.
In [28]:
# Analyze the relationship between 'Income' and 'MntWines'
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Income', y='MntWines', data=df)
plt.title('Income vs. Amount Spent on Wine')
plt.xlabel('Income')
plt.ylabel('Amount Spent on Wine')
plt.show()

# Analyze the relationship between 'Income' and 'MntMeatProducts'
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Income', y='MntMeatProducts', data=df)
plt.title('Income vs. Amount Spent on Meat Products')
plt.xlabel('Income')
plt.ylabel('Amount Spent on Meat Products')
plt.show()

# Analyze the relationship between 'Income' and 'MntFruits'
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Income', y='MntFruits', data=df)
plt.title('Income vs. Amount Spent on Fruits')
plt.xlabel('Income')
plt.ylabel('Amount Spent on Fruits')
plt.show()


# Analyze the relationship between 'Income' and 'NumStorePurchases'
plt.figure(figsize=(8,6))
sns.boxplot(x='Education', y='Income', data=df)
plt.title('Income vs Education')
plt.show()

# Analyze the relationship between categorical variables and 'Response'
for col in ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']:
    plt.figure(figsize=(10, 6))
    sns.countplot(x=col, hue='Response', data=df)
    plt.title(f'Response Rate by {col}')
    plt.xticks(rotation=45)
    plt.show()

# Further bivariate analysis using other suitable plots like pairplot, heatmap etc.
# Example using heatmap for correlation matrix
plt.figure(figsize=(24, 20))
# Calculate the correlation matrix only for numerical features
numerical_df = df.select_dtypes(include=np.number)  # Select only numerical columns
sns.heatmap(numerical_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations from Bivariate Analysis:¶

Income vs. Spending on Products:¶

  • There's a positive correlation between income and the amount spent on wine, meat products, and to a lesser extent, fruits. Higher-income customers tend to spend more on these items. This suggests that income is a significant factor in purchasing behavior for these product categories.
  • The relationship doesn't appear perfectly linear; there's some scatter, indicating other factors might influence spending beyond just income.

Income vs. Education:¶

  • Income distribution varies across different education levels. "PhD" and "Master" levels seem to have higher income distributions, while others overlap more significantly. This suggests an influence of education level on income.
  • "Basic" correlates with substantially less income.

Response Rate by Categorical Variables:¶

  • Education: Response rates seem relatively consistent across education levels, suggesting education level might not be a strong predictor of campaign acceptance.
  • Marital Status: There are subtle differences in response rates among different marital statuses, but no single status overwhelmingly accepts or rejects campaigns. More investigation might be needed.
  • Kidhome/Teenhome: Customers with children or teenagers in their households might demonstrate different response patterns to campaigns. Further investigation into the variations could be worthwhile.

Correlation Matrix:¶

  • The heatmap provides a visual overview of the correlations between numerical features. Strong positive correlations exist between spending on different product categories (e.g., wine, meat, gold products), indicating that customers who spend heavily in one category tend to spend more in others. This suggests that they might be higher value customers overall.
  • Some other correlations are expected, for example, the positive correlation between number of web visits and web purchases makes sense.

Additional insights and potential further analysis:¶

  • Segmentation: The bivariate analysis hints at potential customer segments based on income, education, and spending habits. Further segmentation analysis could reveal distinct customer groups that may respond differently to marketing campaigns.
  • Interaction Effects: There may be potential interaction effects between variables (e.g., the relationship between income and wine spending might be different for various education levels).
  • Outlier Impact: Assess how the removal of outliers affected the bivariate relationships. Some correlations could have changed because of outlier removal.
  • Visualizations: More sophisticated visualizations (e.g., grouped boxplots, violin plots) could reveal additional trends and relationships between the variables.
  • Statistical Tests: Statistical tests could be performed to determine the significance of correlations and differences in response rates between groups.

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Imputing missing values

Think About It:

  • Can we extract the age of each customer and create a new feature?
  • Can we find the total kids and teens in the home?
  • Can we find out how many members each family has?
  • Can we find the total amount spent by the customers on various products?
  • Can we find out how long the customer has been with the company?
  • Can we find out how many offers the customers have accepted?
  • Can we find out amount spent per purchase?
In [29]:
# New Total Family size field
df['Tot_Family_Mem'] = 1  # Initialize with 1 for the individual

# Add 1 for spouse if married
df.loc[df['Marital_Status'] == 'Married', 'Tot_Family_Mem'] += 1

# Add the number of kids and teens
df['Tot_Family_Mem'] = df['Tot_Family_Mem'] + df['Kidhome'] + df['Teenhome']
In [30]:
# Calculate customer tenure
# Convert 'Dt_Customer' to datetime objects, specifying the correct format
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y') #needded format='%d-%m-%Y'

# Calculate the difference between a reference date and the customer's enrollment date
reference_date = datetime.datetime(2025, 1, 1)
df['Customer_Tenure'] = (reference_date - df['Dt_Customer']).dt.days

# Now 'Customer_Tenure' contains the number of days each customer has been with the company
print(df[['Dt_Customer', 'Customer_Tenure']].head(10))
  Dt_Customer  Customer_Tenure
0  2012-09-04             4502
1  2014-03-08             3952
2  2013-08-21             4151
3  2014-02-10             3978
4  2014-01-19             4000
5  2013-09-09             4132
6  2012-11-13             4432
7  2013-05-08             4256
8  2013-06-06             4227
9  2014-03-13             3947
In [31]:
# Calculate the total number of accepted campaigns for each customer
df['TotalAcceptedCmp'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']
In [32]:
# Calculate the age of each customer
current_year = 2025
df['Age'] = current_year - df['Year_Birth']

# Total kids and teens at home
df['Total_Kids'] = df['Kidhome'] + df['Teenhome']

# Total amount spent on products
df['Total_Spending'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']

# Total purchases
df['Total_Num_Purchases'] = df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases']

# Drop rows where either total purchases or total spending is zero, but not both otherwise this will create divide by zero error
df = df[~((df['Total_Num_Purchases'] == 0) ^ (df['Total_Spending'] == 0))]

# Amount spent per purchase
df['Amount_Spent_Per_Purchase'] = df['Total_Spending'] / df['Total_Num_Purchases']
df['Amount_Spent_Per_Purchase'].fillna(0, inplace=True) # Handle potential division by zero


# Display the updated DataFrame
print(df.head(10).T)
                                             0                    1  \
Year_Birth                                1957                 1954   
Education                           Graduation           Graduation   
Marital_Status                          Single               Single   
Income                                 58138.0              46344.0   
Kidhome                                      0                    1   
Teenhome                                     0                    1   
Dt_Customer                2012-09-04 00:00:00  2014-03-08 00:00:00   
Recency                                     58                   38   
MntWines                                   635                   11   
MntFruits                                   88                    1   
MntMeatProducts                            546                    6   
MntFishProducts                            172                    2   
MntSweetProducts                            88                    1   
MntGoldProds                                88                    6   
NumDealsPurchases                            3                    2   
NumWebPurchases                              8                    1   
NumCatalogPurchases                         10                    1   
NumStorePurchases                            4                    2   
NumWebVisitsMonth                            7                    5   
AcceptedCmp3                                 0                    0   
AcceptedCmp4                                 0                    0   
AcceptedCmp5                                 0                    0   
AcceptedCmp1                                 0                    0   
AcceptedCmp2                                 0                    0   
Complain                                     0                    0   
Response                                     1                    0   
Tot_Family_Mem                               1                    3   
Customer_Tenure                           4502                 3952   
TotalAcceptedCmp                             0                    0   
Age                                         68                   71   
Total_Kids                                   0                    2   
Total_Spending                            1617                   27   
Total_Num_Purchases                         22                    4   
Amount_Spent_Per_Purchase                 73.5                 6.75   

                                             2                    3  \
Year_Birth                                1965                 1984   
Education                           Graduation           Graduation   
Marital_Status                        Together             Together   
Income                                 71613.0              26646.0   
Kidhome                                      0                    1   
Teenhome                                     0                    0   
Dt_Customer                2013-08-21 00:00:00  2014-02-10 00:00:00   
Recency                                     26                   26   
MntWines                                   426                   11   
MntFruits                                   49                    4   
MntMeatProducts                            127                   20   
MntFishProducts                            111                   10   
MntSweetProducts                            21                    3   
MntGoldProds                                42                    5   
NumDealsPurchases                            1                    2   
NumWebPurchases                              8                    2   
NumCatalogPurchases                          2                    0   
NumStorePurchases                           10                    4   
NumWebVisitsMonth                            4                    6   
AcceptedCmp3                                 0                    0   
AcceptedCmp4                                 0                    0   
AcceptedCmp5                                 0                    0   
AcceptedCmp1                                 0                    0   
AcceptedCmp2                                 0                    0   
Complain                                     0                    0   
Response                                     0                    0   
Tot_Family_Mem                               1                    2   
Customer_Tenure                           4151                 3978   
TotalAcceptedCmp                             0                    0   
Age                                         60                   41   
Total_Kids                                   0                    1   
Total_Spending                             776                   53   
Total_Num_Purchases                         20                    6   
Amount_Spent_Per_Purchase                 38.8             8.833333   

                                             4                    5  \
Year_Birth                                1981                 1967   
Education                                  PhD               Master   
Marital_Status                         Married             Together   
Income                                 58293.0              62513.0   
Kidhome                                      1                    0   
Teenhome                                     0                    1   
Dt_Customer                2014-01-19 00:00:00  2013-09-09 00:00:00   
Recency                                     94                   16   
MntWines                                   173                  520   
MntFruits                                   43                   42   
MntMeatProducts                            118                   98   
MntFishProducts                             46                    0   
MntSweetProducts                            27                   42   
MntGoldProds                                15                   14   
NumDealsPurchases                            5                    2   
NumWebPurchases                              5                    6   
NumCatalogPurchases                          3                    4   
NumStorePurchases                            6                   10   
NumWebVisitsMonth                            5                    6   
AcceptedCmp3                                 0                    0   
AcceptedCmp4                                 0                    0   
AcceptedCmp5                                 0                    0   
AcceptedCmp1                                 0                    0   
AcceptedCmp2                                 0                    0   
Complain                                     0                    0   
Response                                     0                    0   
Tot_Family_Mem                               3                    2   
Customer_Tenure                           4000                 4132   
TotalAcceptedCmp                             0                    0   
Age                                         44                   58   
Total_Kids                                   1                    1   
Total_Spending                             422                  716   
Total_Num_Purchases                         14                   20   
Amount_Spent_Per_Purchase            30.142857                 35.8   

                                             6                    7  \
Year_Birth                                1971                 1985   
Education                           Graduation                  PhD   
Marital_Status                        Divorced              Married   
Income                                 55635.0              33454.0   
Kidhome                                      0                    1   
Teenhome                                     1                    0   
Dt_Customer                2012-11-13 00:00:00  2013-05-08 00:00:00   
Recency                                     34                   32   
MntWines                                   235                   76   
MntFruits                                   65                   10   
MntMeatProducts                            164                   56   
MntFishProducts                             50                    3   
MntSweetProducts                            49                    1   
MntGoldProds                                27                   23   
NumDealsPurchases                            4                    2   
NumWebPurchases                              7                    4   
NumCatalogPurchases                          3                    0   
NumStorePurchases                            7                    4   
NumWebVisitsMonth                            6                    8   
AcceptedCmp3                                 0                    0   
AcceptedCmp4                                 0                    0   
AcceptedCmp5                                 0                    0   
AcceptedCmp1                                 0                    0   
AcceptedCmp2                                 0                    0   
Complain                                     0                    0   
Response                                     0                    0   
Tot_Family_Mem                               2                    3   
Customer_Tenure                           4432                 4256   
TotalAcceptedCmp                             0                    0   
Age                                         54                   40   
Total_Kids                                   1                    1   
Total_Spending                             590                  169   
Total_Num_Purchases                         17                    8   
Amount_Spent_Per_Purchase            34.705882               21.125   

                                             8                    9  
Year_Birth                                1974                 1950  
Education                                  PhD                  PhD  
Marital_Status                        Together             Together  
Income                                 30351.0               5648.0  
Kidhome                                      1                    1  
Teenhome                                     0                    1  
Dt_Customer                2013-06-06 00:00:00  2014-03-13 00:00:00  
Recency                                     19                   68  
MntWines                                    14                   28  
MntFruits                                    0                    0  
MntMeatProducts                             24                    6  
MntFishProducts                              3                    1  
MntSweetProducts                             3                    1  
MntGoldProds                                 2                   13  
NumDealsPurchases                            1                    1  
NumWebPurchases                              3                    1  
NumCatalogPurchases                          0                    0  
NumStorePurchases                            2                    0  
NumWebVisitsMonth                            9                   20  
AcceptedCmp3                                 0                    1  
AcceptedCmp4                                 0                    0  
AcceptedCmp5                                 0                    0  
AcceptedCmp1                                 0                    0  
AcceptedCmp2                                 0                    0  
Complain                                     0                    0  
Response                                     1                    0  
Tot_Family_Mem                               2                    3  
Customer_Tenure                           4227                 3947  
TotalAcceptedCmp                             0                    1  
Age                                         51                   75  
Total_Kids                                   1                    2  
Total_Spending                              46                   49  
Total_Num_Purchases                          5                    1  
Amount_Spent_Per_Purchase                  9.2                 49.0  
In [33]:
# Create a new column 'Marital_Status_Single'
df['Single'] = 0

# Set the value to 1 if Marital_Status is 'Single', 'Divorced', or 'Widow'
df.loc[df['Marital_Status'].isin(['Single', 'Divorced', 'Widow']), 'Single'] = 1

# Display the updated DataFrame with the new column
print(df[['Marital_Status', 'Single']].head(20))
   Marital_Status  Single
0          Single       1
1          Single       1
2        Together       0
3        Together       0
4         Married       0
5        Together       0
6        Divorced       1
7         Married       0
8        Together       0
9        Together       0
11        Married       0
12       Divorced       1
13       Divorced       1
14        Married       0
15         Single       1
16        Married       0
17       Together       0
18        Married       0
19         Single       1
20        Married       0
In [34]:
df["Well_Educated"] = (df["Education"] != "Basic").astype(int)
print(df[['Well_Educated',"Education"]].head(20))
    Well_Educated   Education
0               1  Graduation
1               1  Graduation
2               1  Graduation
3               1  Graduation
4               1         PhD
5               1      Master
6               1  Graduation
7               1         PhD
8               1         PhD
9               1         PhD
11              0       Basic
12              1  Graduation
13              1      Master
14              1  Graduation
15              1         PhD
16              1  Graduation
17              1  Graduation
18              1      Master
19              1      Master
20              1  Graduation

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

Data Cleaning:¶

  • Missing 'Income' values were dropped due to their low count and the difficulty of accurate imputation.
  • Inconsistent or nonsensical entries in 'Marital_Status' (e.g., "Alone", "Absurd", "YOLO") were removed as they represented a small portion of the data and likely held no meaningful information.
  • The '2n Cycle' value in 'Education' was replaced with 'Master' for consistency.
  • Extreme outliers were observed for the 'Income', some of which were dropped given their small number.
  • Outliers in 'Year_Birth' were identified and removed, potentially due to data entry errors.
  • Rows where the number of Web Purchases was greater than Web Visits were retained, as these could be explained by auto purchases or subscriptions.

Data Distribution:¶

  • Most numerical features exhibited positive skew, common for non-negative values with a concentration near zero.
  • 'Recency' had a somewhat uniform distribution.
  • 'Year_Birth' showed a strong negative skew and was cleaned to remove outliers.

Feature Relationships:¶

  • Positive correlation was found between income and the amount spent on wines, meat products, and fruits.
  • 'Education' level seemed to influence 'Income' distribution, with higher income levels associated with Master and PhD degrees and much lower income level associated with Basic education.
  • The number of web visits and purchases correlated strongly.
  • Correlation between spending on different product categories was observed (suggesting high-value customers).
  • Response rates showed subtle variations across different categories like Marital Status and presence of kids/teens.

Feature Engineering:¶

  • New features were created, enhancing the data's analytical value:
    • 'Tot_Family_Mem': Total family members
    • 'Customer_Tenure': Length of time as a customer
    • 'TotalAcceptedCmp': Total number of offers accepted
    • 'Age': Customer age
    • 'Total_Kids': Total number of children and teenagers
    • 'Total_Spending': Total expenditure on all products
    • 'Total_Purchases': Sum of purchases across different channels
    • 'Amount_Spent_Per_Purchase': Spending per purchase.

Overall the data processing involved addressing outliers, handling inconsistent values, and creating useful features to better understand customer behavior. Further investigation could reveal more nuanced insights.

Data Preparation for Segmentation¶

  • The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
  • Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
  • Plot the correlation plot after we've removed the irrelevant variables
  • Scale the Data
In [35]:
df.describe().T
Out[35]:
count mean min 25% 50% 75% max std
Year_Birth 2193.0 1968.901961 1940.0 1959.0 1970.0 1977.0 1996.0 11.700839
Income 2193.0 51676.200182 2447.0 35246.0 51369.0 68316.0 105471.0 20607.121475
Kidhome 2193.0 0.442316 0.0 0.0 0.0 1.0 2.0 0.53736
Teenhome 2193.0 0.507068 0.0 0.0 0.0 1.0 2.0 0.544607
Dt_Customer 2193 2013-07-09 17:47:01.887824896 2012-07-30 00:00:00 2013-01-16 00:00:00 2013-07-08 00:00:00 2013-12-30 00:00:00 2014-06-29 00:00:00 NaN
Recency 2193.0 49.103055 0.0 24.0 49.0 74.0 99.0 28.939992
MntWines 2193.0 306.962152 0.0 24.0 178.0 508.0 1493.0 337.93087
MntFruits 2193.0 26.4601 0.0 2.0 8.0 33.0 199.0 39.81754
MntMeatProducts 2193.0 165.845873 0.0 16.0 68.0 233.0 1725.0 218.031444
MntFishProducts 2193.0 37.75969 0.0 3.0 12.0 50.0 259.0 54.690012
MntSweetProducts 2193.0 27.116279 0.0 1.0 8.0 34.0 198.0 40.896642
MntGoldProds 2193.0 44.033744 0.0 9.0 25.0 57.0 321.0 51.59191
NumDealsPurchases 2193.0 2.305974 0.0 1.0 2.0 3.0 15.0 1.846739
NumWebPurchases 2193.0 4.094391 0.0 2.0 4.0 6.0 25.0 2.688807
NumCatalogPurchases 2193.0 2.651163 0.0 0.0 2.0 4.0 28.0 2.797136
NumStorePurchases 2193.0 5.838577 0.0 3.0 5.0 8.0 13.0 3.237332
NumWebVisitsMonth 2193.0 5.317829 0.0 3.0 6.0 7.0 20.0 2.360668
AcceptedCmp3 2193.0 0.073871 0.0 0.0 0.0 0.0 1.0 0.261621
AcceptedCmp4 2193.0 0.074783 0.0 0.0 0.0 0.0 1.0 0.263102
AcceptedCmp5 2193.0 0.072959 0.0 0.0 0.0 0.0 1.0 0.260129
AcceptedCmp1 2193.0 0.064295 0.0 0.0 0.0 0.0 1.0 0.245335
AcceptedCmp2 2193.0 0.013224 0.0 0.0 0.0 0.0 1.0 0.114258
Complain 2193.0 0.00912 0.0 0.0 0.0 0.0 1.0 0.095084
Response 2193.0 0.150479 0.0 0.0 0.0 0.0 1.0 0.357622
Tot_Family_Mem 2193.0 2.338349 1.0 2.0 2.0 3.0 5.0 0.900868
Customer_Tenure 2193.0 4193.259006 3839.0 4020.0 4195.0 4368.0 4538.0 202.135423
TotalAcceptedCmp 2193.0 0.299134 0.0 0.0 0.0 0.0 4.0 0.67872
Age 2193.0 56.098039 29.0 48.0 55.0 66.0 85.0 11.700839
Total_Kids 2193.0 0.949384 0.0 0.0 1.0 1.0 3.0 0.748899
Total_Spending 2193.0 608.177839 8.0 69.0 397.0 1048.0 2525.0 602.227874
Total_Num_Purchases 2193.0 12.584131 1.0 6.0 12.0 18.0 32.0 7.166577
Amount_Spent_Per_Purchase 2193.0 37.498756 2.666667 13.0 29.846154 49.125 187.666667 30.065162
Single 2193.0 0.352029 0.0 0.0 0.0 1.0 1.0 0.477712
Well_Educated 2193.0 0.975376 0.0 1.0 1.0 1.0 1.0 0.155011
In [36]:
# Define the features to be used for segmentation
segmentation_features = ['Total_Spending', 'Total_Num_Purchases', 'Amount_Spent_Per_Purchase', 'Total_Kids', 'Single', 'Well_Educated', "Income"]

# Create a new DataFrame with only the segmentation features
segmentation_df = df[segmentation_features]


# Plot the correlation matrix for the segmentation features
plt.figure(figsize=(12, 10))
sns.heatmap(segmentation_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Segmentation Features')
plt.show()

# Scale the data using StandardScaler
scaler = StandardScaler()
scaled_segmentation_df = scaler.fit_transform(segmentation_df)

# Convert the scaled data back to a DataFrame
scaled_segmentation_df = pd.DataFrame(scaled_segmentation_df, columns=segmentation_features)

# Display the first few rows of the scaled DataFrame
print(scaled_segmentation_df.head())
No description has been provided for this image
   Total_Spending  Total_Num_Purchases  Amount_Spent_Per_Purchase  Total_Kids  \
0        1.675532             1.314158                   1.197714   -1.267996   
1       -0.965266            -1.198074                  -1.022970    1.403201   
2        0.278732             1.035021                   0.043291   -1.267996   
3       -0.922084            -0.918937                  -0.953661    0.067602   
4       -0.309219             0.197611                  -0.244721    0.067602   

     Single  Well_Educated    Income  
0  1.356714       0.158888  0.313643  
1  1.356714       0.158888 -0.258814  
2 -0.737075       0.158888  0.967692  
3 -0.737075       0.158888 -1.214915  
4 -0.737075       0.158888  0.321166  

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [37]:
# Apply t-SNE to the scaled data
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(scaled_segmentation_df)

# Create a DataFrame for the t-SNE results
tsne_df = pd.DataFrame(data=tsne_result, columns=['TSNE1', 'TSNE2'])

# Plot the t-SNE results
plt.figure(figsize=(10, 8))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'])
plt.title('t-SNE Visualization of Customer Segments')
plt.xlabel('TSNE Component 1')
plt.ylabel('TSNE Component 2')
plt.show()
No description has been provided for this image

Observation and Insights:

  • A few possible clusters:
    • Top middle
    • Center middle
    • Center right
    • Bottom left
    • Bottom middle
    • and two on the bottom right.

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?

When PCA helps:

  • High dimensionality: The data has many features. PCA reduces dimensionality by finding principal components that capture the most variance in the data. Clustering in a lower-dimensional space is often faster and can reveal clearer cluster structures since noise and irrelevant features are reduced.

  • Correlated features: If the features are highly correlated (as indicated by the correlation heatmap), PCA can decorrelate them. Clustering algorithms can struggle with correlated features, leading to less meaningful clusters.

When PCA might not help (or could hurt):

  • Interpretability: Principal components are linear combinations of the original features and are often harder to interpret. If understanding the meaning of each cluster in terms of the original features is important, PCA might obfuscate the insights. In this case, the original features are relatively interpretable (spending, age, tenure, etc.) while the PCs are not as directly meaningful.

  • Loss of information: PCA discards some information by reducing dimensionality. If the discarded variance contains important clustering information, PCA could harm the cluster quality.

In this specific case:

The code already shows the application of t-SNE before PCA. t-SNE is non-linear. While t-SNE is good at visualization, applying PCA to it before clustering likely offers little additional benefit. There is no clear need to apply PCA at all in this analysis. It's more useful as a means to visualize the result of t-SNE (and even then, t-SNE's own 2-component visualization is sufficient).

What to do instead:

  1. Try Clustering directly on the scaled data (scaled_segmentation_df): This is a good baseline.

  2. If computation time is an issue, then consider PCA: If the clustering on the 8 features of the scaled data is slow, reducing dimensions with PCA first is reasonable. Will try different numbers of components (e.g., 3, 4, 5) to find a balance between speed and information retention in that case.

  3. Carefully evaluate results: Compare clustering solutions on the original scaled data and the PCA-transformed data. Use metrics like silhouette scores or visual inspection of cluster separations. Choose the solution that produces the most meaningful and stable clusters.

In [38]:
# Apply PCA to the scaled data
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_segmentation_df)

# Create a DataFrame for the PCA results
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])

# Plot the PCA results
plt.figure(figsize=(10, 8))
plt.scatter(pca_df['PC1'], pca_df['PC2'])
plt.title('PCA Visualization of Customer Segments')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
No description has been provided for this image

Observation and Insights:

No clear pattern detected visually. There seems to be a line that can run through PC2 = 0 and angle slightly downwards. Seems like 2 clusters with stripes.

K-Means¶

Think About It:

  • How do we determine the optimal K value from the elbow curve?
  • Which metric can be used to determine the final K value?
In [39]:
# Determine optimal K value using the elbow method and silhouette score
inertia = []
silhouette_avg = []
K = range(2, 11)  # Test K values from 2 to 10

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_segmentation_df)
    inertia.append(kmeans.inertia_)
    silhouette_avg.append(silhouette_score(scaled_segmentation_df, kmeans.labels_))

# Plot the elbow curve
plt.figure(figsize=(10, 5))
plt.plot(K, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

# Plot the silhouette scores
plt.figure(figsize=(10, 5))
plt.plot(K, silhouette_avg, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

# Apply K-means with the chosen optimal k
optimal_k = 4

kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(scaled_segmentation_df)
df['Cluster'] = kmeans.labels_
No description has been provided for this image
No description has been provided for this image

Observations:

  • Since all silhouette scores are all low, the elbow test will be better.
  • There is a clear elbow at k=4

Applying KMeans on the PCA data and visualize the clusters¶

In [40]:
# Apply K-means to the PCA data
kmeans_pca = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_pca.fit(pca_df)

# Add cluster labels to the PCA DataFrame
pca_df['Cluster'] = kmeans_pca.labels_

# Visualize the clusters on the PCA plot
plt.figure(figsize=(10, 8))
for cluster in range(optimal_k):
    plt.scatter(pca_df[pca_df['Cluster'] == cluster]['PC1'],
                pca_df[pca_df['Cluster'] == cluster]['PC2'],
                label=f'Cluster {cluster}')
plt.title('K-Means Clustering on PCA Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
No description has been provided for this image

PCA Does not show clusters very well. Without colors, a human eye would only detect two groupings: lower left which is currently green and the rest which is multi-colored. Maybe it is picking up on the stripes so that is why the program is sensing 4 clusters

Cluster Profiling¶

In [41]:
# Calculate cluster profiles with more features
cluster_profiles = df.groupby('Cluster').agg({
    'Total_Spending': 'mean',
    'Total_Num_Purchases': 'mean',
    'Amount_Spent_Per_Purchase': 'mean',
    'Customer_Tenure': 'mean',
    'Age': 'mean',
    'Total_Kids': 'mean',
    'Single': 'mean',
    'Well_Educated': 'mean',
    'Income': 'mean',
    'NumDealsPurchases': 'mean',
    'NumWebPurchases': 'mean',
    'NumCatalogPurchases': 'mean',
    'NumStorePurchases': 'mean',
    'NumWebVisitsMonth': 'mean',
    'Response': 'mean',
    'Complain': 'mean',
    'Recency': 'mean',
    'Tot_Family_Mem': 'mean'
}).reset_index()

# Get the number of features to plot
num_features = len(cluster_profiles.columns) - 1  # Subtract 1 to exclude 'Cluster' column

# Calculate the number of rows and columns for subplots
num_rows = int(np.ceil(num_features / 3))  # 3 columns per row
num_cols = 3

# Create the figure and subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))  # Adjust figsize as needed

# Flatten the axes array for easier iteration
axes = axes.flatten()

# Iterate through features and create bar plots
for i, feature in enumerate(cluster_profiles.columns[1:]):  # Start from index 1 to exclude 'Cluster'
    sns.barplot(x='Cluster', y=feature, data=cluster_profiles, ax=axes[i])
    axes[i].set_title(f'Average {feature} per Cluster')

# Hide any unused subplots
for i in range(num_features, num_rows * num_cols):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()
No description has been provided for this image

Observations and Insights: The means of some features are not very distinct between clusters which indicates that they are not very useful in segmentation or that a different k value in clustering would produce better segmentation with those vaues. For example, Age and Customer Tenure.

There are however many that differentiate quite strongly between clusters such as Total Spending and Income.

Describe the characteristics of each cluster¶

Think About It:

  • Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using?
  • What is the next step to get more meaningful insights?

Summary of each cluster: Cluster 0:

  • Lower Total Spending, Income, and Amount Spent Per Purchase.
  • Fewer Total Purchases, indicating less frequent engagement.
  • Slightly higher proportion of single individuals.
  • Lower education levels.
  • Preferentially uses the deals/ promotions.
  • More likely to complain
  • Lower recency

Cluster 1:

  • Highest Total Spending and Income.
  • High Amount Spent Per Purchase, suggesting larger purchases.
  • Higher number of total purchases, indicating more frequent customer activity.
  • Relatively balanced family composition and education levels.
  • Moderate use of all channels, with a slight preference for web purchases.
  • Less likely to complain
  • Relatively high recency

Cluster 2:

  • Moderate Total Spending and Income.
  • Moderate Amount Spent Per Purchase and Total Purchases.
  • Slightly higher proportion of families.
  • Higher education levels
  • Moderate use of web and catalog channels.
  • Lower recency
  • Less likely to complain

Cluster 3:

  • Lower Total Spending and Income compared to Cluster 1.
  • Moderate Amount Spent Per Purchase.
  • Moderate number of total purchases.
  • Higher proportion of families with kids/teens.
  • Lower proportion of single people.
  • Lower education level
  • Moderate recency
  • Less likely to complain

Pssoble Next Steps for Deeper Insights:

  1. RFM Analysis: Combine Recency, Frequency, and Monetary Value to create more robust segments.
  2. Channel Preferences: Analyze the preferred purchase channels (web, catalog, store) within each cluster to refine targeting strategies.
  3. Product Affinity: Investigate which product categories are most popular within each cluster, and use this to create targeted offers or promotions.
  4. Campaign Effectiveness: Evaluate the success of past marketing campaigns within each cluster.
  5. Customer Lifetime Value: Estimate the potential lifetime value of customers in each cluster.
  6. External Data Integration: Combine customer data with external sources like demographics and economic indicators.
  7. Qualitative Research: Conduct surveys or interviews with customers from each cluster to gather deeper insights into their motivations and behaviors.

K-Medoids¶

In [42]:
# Determine optimal K value using the elbow method and silhouette score for K-Medoids
inertia_kmedoids = []
silhouette_avg_kmedoids = []
K = range(2, 11)  # Test K values from 2 to 10

for k in K:
    kmedoids = KMedoids(n_clusters=k, random_state=42)
    kmedoids.fit(scaled_segmentation_df)
    inertia_kmedoids.append(kmedoids.inertia_)
    silhouette_avg_kmedoids.append(silhouette_score(scaled_segmentation_df, kmedoids.labels_))

# Plot the elbow curve for K-Medoids
plt.figure(figsize=(10, 5))
plt.plot(K, inertia_kmedoids, marker='o')
plt.title('Elbow Method for Optimal k (K-Medoids)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

# Plot the silhouette scores for K-Medoids
plt.figure(figsize=(10, 5))
plt.plot(K, silhouette_avg_kmedoids, marker='o')
plt.title('Silhouette Score for Optimal k (K-Medoids)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

# Apply K-Medoids with the chosen optimal k
optimal_k_kmedoids = 7 # clear elbow at 7

kmedoids = KMedoids(n_clusters=optimal_k_kmedoids, random_state=42)
kmedoids.fit(scaled_segmentation_df)
df['Cluster_KMedoids'] = kmedoids.labels_
No description has been provided for this image
No description has been provided for this image

Visualize the clusters using PCA¶

In [43]:
# Apply K-Medoids with PCA first
pca = PCA(n_components=2, random_state=42)  # Use PCA for dimensionality reduction
pca_result = pca.fit_transform(scaled_segmentation_df)

optimal_k_kmedoids = 7  # Use the previously determined optimal k

kmedoids_pca = KMedoids(n_clusters=optimal_k_kmedoids, random_state=42)
kmedoids_pca.fit(pca_result)

# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in range(optimal_k_kmedoids):
    plt.scatter(pca_result[kmedoids_pca.labels_ == cluster, 0],
                pca_result[kmedoids_pca.labels_ == cluster, 1],
                label=f'Cluster {cluster}')
plt.title('K-Medoids Clustering with PCA Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
No description has been provided for this image

Cluster Profiling¶

In [44]:
# Calculate cluster profiles with more features for K-Medoids
cluster_profiles_kmedoids = df.groupby('Cluster_KMedoids').agg({
    'Total_Spending': 'mean',
    'Total_Num_Purchases': 'mean',
    'Amount_Spent_Per_Purchase': 'mean',
    'Customer_Tenure': 'mean',
    'Age': 'mean',
    'Total_Kids': 'mean',
    'Single': 'mean',
    'Well_Educated': 'mean',
    'Income': 'mean',
    'NumDealsPurchases': 'mean',
    'NumWebPurchases': 'mean',
    'NumCatalogPurchases': 'mean',
    'NumStorePurchases': 'mean',
    'NumWebVisitsMonth': 'mean',
    'Response': 'mean',
    'Complain': 'mean',
    'Recency': 'mean',
    'Tot_Family_Mem': 'mean'
}).reset_index()

# Get the number of features to plot
num_features = len(cluster_profiles_kmedoids.columns) - 1

# Calculate the number of rows and columns for subplots
num_rows = int(np.ceil(num_features / 3))
num_cols = 3

# Create the figure and subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()

# Iterate through features and create bar plots
for i, feature in enumerate(cluster_profiles_kmedoids.columns[1:]):
    sns.barplot(x='Cluster_KMedoids', y=feature, data=cluster_profiles_kmedoids, ax=axes[i])
    axes[i].set_title(f'Average {feature} per Cluster (K-Medoids)')

# Hide any unused subplots
for i in range(num_features, num_rows * num_cols):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()
No description has been provided for this image

Observations and Insights:

  • K-Medoids Optimal k: The elbow method for K-Medoids suggests an optimal k value of 7, as indicated by a clear elbow point in the inertia plot at k=7. The silhouette scores also support this, showing relatively better separation at that point.

  • K-Medoids Cluster Visualization (PCA): The PCA visualization of K-Medoids clusters shows some separation between groups, but it's not as distinct as the t-SNE visualizations performed earlier. PCA, by its nature, aims to capture the most variance, not necessarily to optimize for visual separation of clusters. It's important to note that using PCA as a visualization tool may not accurately reflect the true clustering structure, especially when dealing with non-linear relationships in the data. This is again a point in favor of using t-SNE or other similar approaches for visualization if the goal is to view cluster separation graphically.

  • K-Medoids Cluster Profiling: The cluster profiles generated by K-Medoids provide insights into the characteristics of each cluster. By analyzing the mean values of various features across clusters, one can observe the differences between customer segments. For example, one might find one cluster with higher average spending and income, another with frequent small purchases, and others with distinct channel preferences. These cluster means provide evidence for whether the clusters are meaningful and reveal characteristics associated with them. For example and very similar results to the k-means: Customer Tenure, Age, and Recency do not have much of an effect on the overall result.Well-educated also does not have much of an impact with K-medoids but does with K-Means.

Characteristics of each cluster¶

Summary for each cluster:

Observations and Insights:

K-Medoids Cluster Analysis: Characteristics of 7 Clusters

Cluster 0:

  • Summary: This cluster represents customers with moderate spending, income, and purchase frequency. They exhibit a balanced approach across various purchasing channels and show a moderate level of engagement with the brand. They are moderately likely to complain, but their recency is low.
  • Characteristics: Average spending and income, moderate purchase frequency, balanced channel usage, and likely less engaged with recent promotions or marketing efforts. They show a moderate level of response to marketing.
  • Observations and Insights: This segment represents a typical customer base, and targeted campaigns could focus on increasing engagement and average order value.

Cluster 1:

  • Summary: Customers in this cluster exhibit high total spending and income, along with a high amount spent per purchase, showing a tendency toward larger purchases. They are the highest spending cluster and the most engaged overall. They are the least likely to complain.
  • Characteristics: High-value customers with significant purchasing power, likely loyal to the brand, and responsive to premium offerings. They exhibit a high recency.
  • Observations and Insights: Focus on retention strategies, exclusive offers, and premium services to cater to their high-value nature.

Cluster 2:

  • Summary: This group demonstrates the lowest average spending and income of all clusters. Their frequency of purchases is also low, and they're highly price-sensitive.
  • Characteristics: Lower spending, lower income, low purchase frequency, and a potential inclination towards promotions and deals. Their recency is low. They are less likely to respond to marketing efforts.
  • Observations and Insights: Focus on attracting and retaining these customers with budget-friendly offers and promotions and a high recency.

Cluster 3:

  • Summary: This cluster represents customers with moderate spending and income, similar to cluster 0, but they show a higher propensity for online purchases.
  • Characteristics: Moderate spending, income, and purchase frequency, but with a strong preference for online channels.
  • Observations and Insights: Target these customers with online-focused promotions, personalized recommendations, and enhanced digital experience.

Cluster 4:

  • Summary: Characterized by relatively low spending and purchase frequency, these customers exhibit a preference for catalog purchases.
  • Characteristics: Low spending and frequency, but a distinctive preference for catalog purchases.
  • Observations and Insights: Catalogs may be a primary communication channel for this group. Further investigation might be worthwhile.

Cluster 5:

  • Summary: This cluster has a balance of characteristics. They are moderately frequent purchasers with a propensity to make higher-value purchases than some other clusters.
  • Characteristics: Moderate spending, moderate-to-high income, relatively high amount spent per purchase, and a balance across purchase channels.
  • Observations and Insights: These are potentially valuable customers, deserving more attention and potentially higher-value offers to encourage continued loyalty and spending.

Cluster 6:

  • Summary: This segment displays a high level of recent engagement. They may be new customers, or those who have recently become more active.
  • Characteristics: High recency but with a wide range of spending and income levels.
  • Observations and Insights: Explore their recent engagement and tailor offers based on individual spending habits and potential. They could be highly receptive to promotional offers.

Hierarchical Clustering¶

  • Find the Cophenetic correlation for different distances with different linkage methods.
  • Create the dendrograms for different linkages
  • Explore different linkages with each distance metric
In [45]:
# Define distance metrics and linkage methods
distance_metrics = ['euclidean', 'cosine', 'cityblock']  # Example distances
linkage_methods = ['ward', 'complete', 'average', 'single'] # Example linkages

for distance_metric in distance_metrics:
    for linkage_method in linkage_methods:
        # Calculate pairwise distances
        distance_matrix = pdist(scaled_segmentation_df, metric=distance_metric)

        # Apply hierarchical clustering
        Z = sch.linkage(distance_matrix, method=linkage_method)

        # Calculate cophenetic correlation
        c, coph_dists = cophenet(Z, distance_matrix)

        print(f"Distance metric: {distance_metric}, Linkage method: {linkage_method}, Cophenetic correlation: {c}")

        # Create dendrogram
        plt.figure(figsize=(10, 7))
        sch.dendrogram(Z)
        plt.title(f"Dendrogram - Distance: {distance_metric}, Linkage: {linkage_method}")
        plt.xlabel("Data Points")
        plt.ylabel("Euclidean Distance")
        plt.show()
Distance metric: euclidean, Linkage method: ward, Cophenetic correlation: 0.6123351772311272
No description has been provided for this image
Distance metric: euclidean, Linkage method: complete, Cophenetic correlation: 0.7554965204185778
No description has been provided for this image
Distance metric: euclidean, Linkage method: average, Cophenetic correlation: 0.8115901083822973
No description has been provided for this image
Distance metric: euclidean, Linkage method: single, Cophenetic correlation: 0.6645719278052516
No description has been provided for this image
Distance metric: cosine, Linkage method: ward, Cophenetic correlation: 0.790770108442397
No description has been provided for this image
Distance metric: cosine, Linkage method: complete, Cophenetic correlation: 0.6142869145599397
No description has been provided for this image
Distance metric: cosine, Linkage method: average, Cophenetic correlation: 0.8228621339320061
No description has been provided for this image
Distance metric: cosine, Linkage method: single, Cophenetic correlation: 0.12150149904066151
No description has been provided for this image
Distance metric: cityblock, Linkage method: ward, Cophenetic correlation: 0.6609966974251326
No description has been provided for this image
Distance metric: cityblock, Linkage method: complete, Cophenetic correlation: 0.6885602896156096
No description has been provided for this image
Distance metric: cityblock, Linkage method: average, Cophenetic correlation: 0.7383057177388052
No description has been provided for this image
Distance metric: cityblock, Linkage method: single, Cophenetic correlation: 0.5360541365004085
No description has been provided for this image

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?

No, deciding the number of clusters solely based on a visual inspection of the dendrogram's horizontal cut can be subjective. While a dendrogram helps visualize the hierarchical relationships between data points, an optimal horizontal cut isn't always obvious. Different cuts will lead to different numbers of clusters. There is no single, universally correct place to cut the dendrogram.

  • What is the next step in obtaining number of clusters based on the dendrogram?

The next step is to use a more rigorous approach than simply visual inspection. Here are some methods:

  1. Inconsistent Method: This method looks at the height of the fusion in the dendrogram and tries to find an appropriate cutoff based on inconsistencies in the heights. This is useful for determining significant clusters.
  2. Height Cutoff: Define a threshold for the distance or height on the dendrogram where one wants to make the cut. This value is arbitrary, but one might base it on domain expertise, or look for large jumps in heights in the dendrogram.
  3. Analyze Cophenetic Correlation: The cophenetic correlation coefficient measures how faithfully the dendrogram preserves the original pairwise distances between data points. A higher cophenetic correlation coefficient suggests a better representation of the data, and might indirectly give one an idea of the number of clusters. However, it's not a direct measure of the optimal number of clusters.
  4. Silhouette Score or other evaluation metrics: After clustering based on cutting the dendrogram, apply traditional methods to assess the number of clusters.

These approaches can provide more objective guidance.

  • Are there any distinct clusters in any of the dendrograms?
    • Each dendrogram has distinct tall horizontal lines connect at the top by wide and also narrow vertical lines.
    • Some dendrograms are list distinct such as Complete linkages and some of the Average linkages.

Visualize the clusters using PCA¶

In [46]:
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(scaled_segmentation_df)

# Assuming 'Z' is defined from the hierarchical clustering in the previous cell (ipython-input-49-c43885c2890a)
# Get cluster labels using fcluster, specifying the desired number of clusters (e.g., 4)
labels = fcluster(Z, t=4, criterion='maxclust')

# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in np.unique(labels):  # Use np.unique to find unique cluster labels
    plt.scatter(pca_result[labels == cluster, 0], pca_result[labels == cluster, 1], label=f'Cluster {cluster}')

plt.title('Hierarchical Clustering with PCA Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
No description has been provided for this image

Cluster Profiling¶

In [47]:
# Calculate cluster profiles for hierarchical clustering
cluster_profiles_hierarchical = df.groupby(labels).agg({
    'Total_Spending': 'mean',
    'Total_Num_Purchases': 'mean',
    'Amount_Spent_Per_Purchase': 'mean',
    'Customer_Tenure': 'mean',
    'Age': 'mean',
    'Total_Kids': 'mean',
    'Single': 'mean',
    'Well_Educated': 'mean',
    'Income': 'mean',
    'NumDealsPurchases': 'mean',
    'NumWebPurchases': 'mean',
    'NumCatalogPurchases': 'mean',
    'NumStorePurchases': 'mean',
    'NumWebVisitsMonth': 'mean',
    'Response': 'mean',
    'Complain': 'mean',
    'Recency': 'mean',
    'Tot_Family_Mem': 'mean'
}).reset_index()

# Plotting cluster profiles for hierarchical clustering
num_features = len(cluster_profiles_hierarchical.columns) - 1
num_rows = int(np.ceil(num_features / 3))
num_cols = 3

fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(cluster_profiles_hierarchical.columns[1:]):
    sns.barplot(x='index', y=feature, data=cluster_profiles_hierarchical, ax=axes[i])
    axes[i].set_title(f'Average {feature} per Cluster (Hierarchical)')

for i in range(num_features, num_rows * num_cols):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()
No description has been provided for this image

Observations and Insights:

  • Cluster 3 (Green) contains most of the data points. Next is cluster 1 (blue), Then cluster 4 and 2 with what seems to be two and one data points respectively.
  • Feature Importance: The consistent observation across all three clustering methods (K-Means, K-Medoids, Hierarchical) that Customer Tenure, Age, and Recency have minimal impact suggests that these factors might not be strong drivers of customer segmentation in this dataset. This implies that other variables like spending habits, purchase frequency, and other preferences may be more important for defining customer segments.
  • Only cluster 3 has complaints, and likewise has data points in every cloumn.
  • Heirarchical clustering does not capture clusters very well. Ideally only cluster 1 and 3 would capture the clustering the best. Two clusters is better than four in this case.

Characteristics of each cluster¶

Summary of each cluster:

Characteristics of 4 Clusters (Based on provided visualization and profiling)

Cluster 1:

  • Summary: This cluster likely represents a large portion of the customer base, exhibiting a wide range of characteristics but possibly with a slightly higher than average spending and total purchases. It seems to be the most common profile. They also seem to be most likely to complain.
  • Characteristics: Moderate to high spending, moderate purchase frequency, varied channel usage.

Cluster 2:

  • Summary: This cluster appears to contain very few customers and is relatively distinct.
  • Characteristics: Low frequency of purchases and spending, with the caveat that the small sample size limits drawing strong conclusions.

Cluster 3:

  • Summary: This cluster shows characteristics similar to the largest cluster but with potentially lower spending, total purchases, and a different channel preference. Possibly representing customers who are less engaged or prefer alternative purchase channels.
  • Characteristics: Moderate spending and frequency, possibly with a stronger preference for specific purchase channels.

Cluster 4:

  • Summary: Another small cluster, potentially representing another niche group of customers.
  • Characteristics: Difficult to determine clear characteristics due to the limited number of data points.

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [48]:
# Define the parameter grid to search
eps_values = [0.1, 0.5, 1, 2, 5]  # Example eps values
min_samples_values = [2, 5, 10, 20]  # Example min_samples values


best_score = -1
best_eps = None
best_min_samples = None

for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(scaled_segmentation_df)

        # Check if more than one cluster is found
        if len(set(labels)) > 1:  # Ignore cases with only noise or one cluster
            score = silhouette_score(scaled_segmentation_df, labels)

            if score > best_score:
                best_score = score
                best_eps = eps
                best_min_samples = min_samples

            print(f"eps: {eps}, min_samples: {min_samples}, silhouette score: {score}")


print(f"\nBest parameters: eps={best_eps}, min_samples={best_min_samples}, silhouette score={best_score}")

# Fit DBSCAN with the best hyperparameters
best_dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
best_labels = best_dbscan.fit_predict(scaled_segmentation_df)

#Now one can use 'best_labels' for further analysis like cluster profiling
eps: 0.1, min_samples: 2, silhouette score: -0.14111668583144424
eps: 0.1, min_samples: 5, silhouette score: -0.4536308807195296
eps: 0.1, min_samples: 10, silhouette score: -0.251366827044772
eps: 0.5, min_samples: 2, silhouette score: 0.052830037298867406
eps: 0.5, min_samples: 5, silhouette score: 0.13500354391650107
eps: 0.5, min_samples: 10, silhouette score: 0.0811168552184625
eps: 0.5, min_samples: 20, silhouette score: -0.009236040212927215
eps: 1, min_samples: 2, silhouette score: 0.11084030466103999
eps: 1, min_samples: 5, silhouette score: 0.14564400885419426
eps: 1, min_samples: 10, silhouette score: 0.10289200663344249
eps: 1, min_samples: 20, silhouette score: 0.13132599165460845
eps: 2, min_samples: 2, silhouette score: 0.24181438952337633
eps: 2, min_samples: 5, silhouette score: 0.24159549640464362
eps: 2, min_samples: 10, silhouette score: 0.24159549640464362
eps: 2, min_samples: 20, silhouette score: 0.24155740743355267
eps: 5, min_samples: 2, silhouette score: 0.5750308184494759
eps: 5, min_samples: 5, silhouette score: 0.5750308184494759
eps: 5, min_samples: 10, silhouette score: 0.5750308184494759
eps: 5, min_samples: 20, silhouette score: 0.5750308184494759

Best parameters: eps=5, min_samples=2, silhouette score=0.5750308184494759

Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶

In [49]:
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(scaled_segmentation_df)

# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in np.unique(best_labels):
    plt.scatter(pca_result[best_labels == cluster, 0], pca_result[best_labels == cluster, 1], label=f'Cluster {cluster}')

plt.title('DBSCAN Clustering with PCA Visualization (Best Hyperparameters)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
No description has been provided for this image

Observations and Insights:

  • Two clusters seems to be the best overall custering options. Visually, there are clearly two distinct seperations. Even though the elbow test didn't reveal it in the K-Means and K-Medoids, the K-Means' and K-Medoids' silhouette scores had k=2 as the highest score, and now likewise with DBSCAN.
  • Visually there are also about 7 horisontal lines/stripes of data points, so plausibly 7 clusters would work too, albeit difficult to programmatically solve. There are 5 stripes in cluster 0 (blue) and 2 in cluster 1 (orange), but 2 clusters still seems most clear.

Think about it:

  • Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?

No, as seen below, it maxes out at silhouette score=0.5750308184494759 which is what it reached above anyways. The above selection of eps and min Sample showed an upward trend of increasing silhouette scores as those two variables increased. Adding more values intermediarily would not have change that silhouette score since it eventually met that 0.575... score at the max of each interval of the those variables. Adding more would have just meant that one would have found that silhouette score much sooner, which is what is shown below, but the score remained the same.

In [50]:
# Expanding the parameter grid for DBSCAN
eps_values = np.linspace(0.1, 5, 20)  # More granular eps values
min_samples_values = [2, 3, 4, 5, 10, 15, 20]  # More min_samples values

best_score = -1
best_eps = None
best_min_samples = None

for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(scaled_segmentation_df)

        # Check if more than one cluster is found
        if len(set(labels)) > 1:  # Ignore cases with only noise or one cluster
            score = silhouette_score(scaled_segmentation_df, labels)

            if score > best_score:
                best_score = score
                best_eps = eps
                best_min_samples = min_samples

            print(f"eps: {eps}, min_samples: {min_samples}, silhouette score: {score}")

print(f"\nBest parameters: eps={best_eps}, min_samples={best_min_samples}, silhouette score={best_score}")
eps: 0.1, min_samples: 2, silhouette score: -0.14111668583144424
eps: 0.1, min_samples: 3, silhouette score: -0.4105862013979167
eps: 0.1, min_samples: 4, silhouette score: -0.40478304072544663
eps: 0.1, min_samples: 5, silhouette score: -0.4536308807195296
eps: 0.1, min_samples: 10, silhouette score: -0.251366827044772
eps: 0.35789473684210527, min_samples: 2, silhouette score: -0.015290880843999769
eps: 0.35789473684210527, min_samples: 3, silhouette score: 0.016117441256415226
eps: 0.35789473684210527, min_samples: 4, silhouette score: 0.045999073919817966
eps: 0.35789473684210527, min_samples: 5, silhouette score: 0.014971351444800638
eps: 0.35789473684210527, min_samples: 10, silhouette score: -0.023981122080325056
eps: 0.35789473684210527, min_samples: 15, silhouette score: -0.06229472779327022
eps: 0.35789473684210527, min_samples: 20, silhouette score: -0.11974086168355594
eps: 0.6157894736842106, min_samples: 2, silhouette score: 0.02438652516747386
eps: 0.6157894736842106, min_samples: 3, silhouette score: 0.07509040963529537
eps: 0.6157894736842106, min_samples: 4, silhouette score: 0.0756178083308249
eps: 0.6157894736842106, min_samples: 5, silhouette score: 0.10128874193019895
eps: 0.6157894736842106, min_samples: 10, silhouette score: 0.1601829469663608
eps: 0.6157894736842106, min_samples: 15, silhouette score: 0.1198293670185905
eps: 0.6157894736842106, min_samples: 20, silhouette score: 0.07350977071061665
eps: 0.8736842105263158, min_samples: 2, silhouette score: 0.08928320061893627
eps: 0.8736842105263158, min_samples: 3, silhouette score: 0.09469074961458734
eps: 0.8736842105263158, min_samples: 4, silhouette score: 0.10367563469295368
eps: 0.8736842105263158, min_samples: 5, silhouette score: 0.10367563469295368
eps: 0.8736842105263158, min_samples: 10, silhouette score: 0.08493103368559306
eps: 0.8736842105263158, min_samples: 15, silhouette score: 0.12582354153312378
eps: 0.8736842105263158, min_samples: 20, silhouette score: 0.15589274057400293
eps: 1.1315789473684212, min_samples: 2, silhouette score: 0.1122364288839039
eps: 1.1315789473684212, min_samples: 3, silhouette score: 0.12124520095358417
eps: 1.1315789473684212, min_samples: 4, silhouette score: 0.1456398424482757
eps: 1.1315789473684212, min_samples: 5, silhouette score: 0.1456398424482757
eps: 1.1315789473684212, min_samples: 10, silhouette score: 0.1422729915068303
eps: 1.1315789473684212, min_samples: 15, silhouette score: 0.13367185710740154
eps: 1.1315789473684212, min_samples: 20, silhouette score: 0.1259998614145595
eps: 1.3894736842105266, min_samples: 2, silhouette score: 0.212723241535456
eps: 1.3894736842105266, min_samples: 3, silhouette score: 0.2412526938506453
eps: 1.3894736842105266, min_samples: 4, silhouette score: 0.2412526938506453
eps: 1.3894736842105266, min_samples: 5, silhouette score: 0.24068433216959192
eps: 1.3894736842105266, min_samples: 10, silhouette score: 0.23995780132929045
eps: 1.3894736842105266, min_samples: 15, silhouette score: 0.23957477013912357
eps: 1.3894736842105266, min_samples: 20, silhouette score: 0.2328067638417386
eps: 1.6473684210526318, min_samples: 2, silhouette score: 0.2323989932128011
eps: 1.6473684210526318, min_samples: 3, silhouette score: 0.24155740743355267
eps: 1.6473684210526318, min_samples: 4, silhouette score: 0.24155740743355267
eps: 1.6473684210526318, min_samples: 5, silhouette score: 0.24155740743355267
eps: 1.6473684210526318, min_samples: 10, silhouette score: 0.24158442699821794
eps: 1.6473684210526318, min_samples: 15, silhouette score: 0.24160332416774805
eps: 1.6473684210526318, min_samples: 20, silhouette score: 0.24160332416774805
eps: 1.905263157894737, min_samples: 2, silhouette score: 0.2323989932128011
eps: 1.905263157894737, min_samples: 3, silhouette score: 0.24155740743355267
eps: 1.905263157894737, min_samples: 4, silhouette score: 0.24155740743355267
eps: 1.905263157894737, min_samples: 5, silhouette score: 0.24155740743355267
eps: 1.905263157894737, min_samples: 10, silhouette score: 0.24155740743355267
eps: 1.905263157894737, min_samples: 15, silhouette score: 0.24155740743355267
eps: 1.905263157894737, min_samples: 20, silhouette score: 0.24142363970547726
eps: 2.1631578947368424, min_samples: 2, silhouette score: 0.3203862077714656
eps: 2.1631578947368424, min_samples: 3, silhouette score: 0.40996695559039703
eps: 2.1631578947368424, min_samples: 4, silhouette score: 0.40996695559039703
eps: 2.1631578947368424, min_samples: 5, silhouette score: 0.40996695559039703
eps: 2.1631578947368424, min_samples: 10, silhouette score: 0.40996695559039703
eps: 2.1631578947368424, min_samples: 15, silhouette score: 0.40996695559039703
eps: 2.1631578947368424, min_samples: 20, silhouette score: 0.40996695559039703
eps: 2.421052631578948, min_samples: 2, silhouette score: 0.3492096468722446
eps: 2.421052631578948, min_samples: 3, silhouette score: 0.4377385817064511
eps: 2.421052631578948, min_samples: 4, silhouette score: 0.4377385817064511
eps: 2.421052631578948, min_samples: 5, silhouette score: 0.4377385817064511
eps: 2.421052631578948, min_samples: 10, silhouette score: 0.4377385817064511
eps: 2.421052631578948, min_samples: 15, silhouette score: 0.4377385817064511
eps: 2.421052631578948, min_samples: 20, silhouette score: 0.4377385817064511
eps: 2.678947368421053, min_samples: 2, silhouette score: 0.37859312886582797
eps: 2.678947368421053, min_samples: 3, silhouette score: 0.4689235915249308
eps: 2.678947368421053, min_samples: 4, silhouette score: 0.4689235915249308
eps: 2.678947368421053, min_samples: 5, silhouette score: 0.4689235915249308
eps: 2.678947368421053, min_samples: 10, silhouette score: 0.4689235915249308
eps: 2.678947368421053, min_samples: 15, silhouette score: 0.4689235915249308
eps: 2.678947368421053, min_samples: 20, silhouette score: 0.4689235915249308
eps: 2.936842105263158, min_samples: 2, silhouette score: 0.4689235915249308
eps: 2.936842105263158, min_samples: 3, silhouette score: 0.4689235915249308
eps: 2.936842105263158, min_samples: 4, silhouette score: 0.4689235915249308
eps: 2.936842105263158, min_samples: 5, silhouette score: 0.4689235915249308
eps: 2.936842105263158, min_samples: 10, silhouette score: 0.4689235915249308
eps: 2.936842105263158, min_samples: 15, silhouette score: 0.4689235915249308
eps: 2.936842105263158, min_samples: 20, silhouette score: 0.4689235915249308
eps: 3.1947368421052635, min_samples: 2, silhouette score: 0.4689235915249308
eps: 3.1947368421052635, min_samples: 3, silhouette score: 0.4689235915249308
eps: 3.1947368421052635, min_samples: 4, silhouette score: 0.4689235915249308
eps: 3.1947368421052635, min_samples: 5, silhouette score: 0.4689235915249308
eps: 3.1947368421052635, min_samples: 10, silhouette score: 0.4689235915249308
eps: 3.1947368421052635, min_samples: 15, silhouette score: 0.4689235915249308
eps: 3.1947368421052635, min_samples: 20, silhouette score: 0.4689235915249308
eps: 3.452631578947369, min_samples: 2, silhouette score: 0.5750308184494759
eps: 3.452631578947369, min_samples: 3, silhouette score: 0.5750308184494759
eps: 3.452631578947369, min_samples: 4, silhouette score: 0.5750308184494759
eps: 3.452631578947369, min_samples: 5, silhouette score: 0.5019438151959774
eps: 3.452631578947369, min_samples: 10, silhouette score: 0.5019438151959774
eps: 3.452631578947369, min_samples: 15, silhouette score: 0.5019438151959774
eps: 3.452631578947369, min_samples: 20, silhouette score: 0.5019438151959774
eps: 3.710526315789474, min_samples: 2, silhouette score: 0.5750308184494759
eps: 3.710526315789474, min_samples: 3, silhouette score: 0.5750308184494759
eps: 3.710526315789474, min_samples: 4, silhouette score: 0.5750308184494759
eps: 3.710526315789474, min_samples: 5, silhouette score: 0.5750308184494759
eps: 3.710526315789474, min_samples: 10, silhouette score: 0.5019438151959774
eps: 3.710526315789474, min_samples: 15, silhouette score: 0.5019438151959774
eps: 3.710526315789474, min_samples: 20, silhouette score: 0.5019438151959774
eps: 3.9684210526315793, min_samples: 2, silhouette score: 0.5750308184494759
eps: 3.9684210526315793, min_samples: 3, silhouette score: 0.5750308184494759
eps: 3.9684210526315793, min_samples: 4, silhouette score: 0.5750308184494759
eps: 3.9684210526315793, min_samples: 5, silhouette score: 0.5750308184494759
eps: 3.9684210526315793, min_samples: 10, silhouette score: 0.5750308184494759
eps: 3.9684210526315793, min_samples: 15, silhouette score: 0.5019438151959774
eps: 3.9684210526315793, min_samples: 20, silhouette score: 0.5019438151959774
eps: 4.226315789473684, min_samples: 2, silhouette score: 0.5750308184494759
eps: 4.226315789473684, min_samples: 3, silhouette score: 0.5750308184494759
eps: 4.226315789473684, min_samples: 4, silhouette score: 0.5750308184494759
eps: 4.226315789473684, min_samples: 5, silhouette score: 0.5750308184494759
eps: 4.226315789473684, min_samples: 10, silhouette score: 0.5750308184494759
eps: 4.226315789473684, min_samples: 15, silhouette score: 0.5750308184494759
eps: 4.226315789473684, min_samples: 20, silhouette score: 0.5750308184494759
eps: 4.484210526315789, min_samples: 2, silhouette score: 0.5750308184494759
eps: 4.484210526315789, min_samples: 3, silhouette score: 0.5750308184494759
eps: 4.484210526315789, min_samples: 4, silhouette score: 0.5750308184494759
eps: 4.484210526315789, min_samples: 5, silhouette score: 0.5750308184494759
eps: 4.484210526315789, min_samples: 10, silhouette score: 0.5750308184494759
eps: 4.484210526315789, min_samples: 15, silhouette score: 0.5750308184494759
eps: 4.484210526315789, min_samples: 20, silhouette score: 0.5750308184494759
eps: 4.742105263157895, min_samples: 2, silhouette score: 0.5750308184494759
eps: 4.742105263157895, min_samples: 3, silhouette score: 0.5750308184494759
eps: 4.742105263157895, min_samples: 4, silhouette score: 0.5750308184494759
eps: 4.742105263157895, min_samples: 5, silhouette score: 0.5750308184494759
eps: 4.742105263157895, min_samples: 10, silhouette score: 0.5750308184494759
eps: 4.742105263157895, min_samples: 15, silhouette score: 0.5750308184494759
eps: 4.742105263157895, min_samples: 20, silhouette score: 0.5750308184494759
eps: 5.0, min_samples: 2, silhouette score: 0.5750308184494759
eps: 5.0, min_samples: 3, silhouette score: 0.5750308184494759
eps: 5.0, min_samples: 4, silhouette score: 0.5750308184494759
eps: 5.0, min_samples: 5, silhouette score: 0.5750308184494759
eps: 5.0, min_samples: 10, silhouette score: 0.5750308184494759
eps: 5.0, min_samples: 15, silhouette score: 0.5750308184494759
eps: 5.0, min_samples: 20, silhouette score: 0.5750308184494759

Best parameters: eps=3.452631578947369, min_samples=2, silhouette score=0.5750308184494759

Characteristics of each cluster¶

Summary of each cluster:

In [51]:
# Calculate cluster profiles for DBSCAN clustering
cluster_profiles_dbscan = df.groupby(best_labels).agg({
    'Total_Spending': 'mean',
    'Total_Num_Purchases': 'mean',
    'Amount_Spent_Per_Purchase': 'mean',
    'Customer_Tenure': 'mean',
    'Age': 'mean',
    'Total_Kids': 'mean',
    'Single': 'mean',
    'Well_Educated': 'mean',
    'Income': 'mean',
    'NumDealsPurchases': 'mean',
    'NumWebPurchases': 'mean',
    'NumCatalogPurchases': 'mean',
    'NumStorePurchases': 'mean',
    'NumWebVisitsMonth': 'mean',
    'Response': 'mean',
    'Complain': 'mean',
    'Recency': 'mean',
    'Tot_Family_Mem': 'mean'
}).reset_index()

# Plotting cluster profiles for DBSCAN clustering
num_features = len(cluster_profiles_dbscan.columns) - 1
num_rows = int(np.ceil(num_features / 3))
num_cols = 3

fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(cluster_profiles_dbscan.columns[1:]):
    sns.barplot(x='index', y=feature, data=cluster_profiles_dbscan, ax=axes[i])
    axes[i].set_title(f'Average {feature} per Cluster (DBSCAN)')

for i in range(num_features, num_rows * num_cols):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

# Characteristics of each cluster (DBSCAN)
# Assuming best_labels has two clusters (0 and 1)
print("Characteristics of Clusters (DBSCAN):")

# Cluster 0
print("\nCluster 0:")
print(cluster_profiles_dbscan.iloc[0])

# Cluster 1
print("\nCluster 1:")
cluster_profiles_dbscan.iloc[1]
No description has been provided for this image
Characteristics of Clusters (DBSCAN):

Cluster 0:
index                            0.000000
Total_Spending                 621.466573
Total_Num_Purchases             12.769986
Amount_Spent_Per_Purchase       38.120923
Customer_Tenure               4191.392239
Age                             56.314165
Total_Kids                       0.955119
Single                           0.351566
Well_Educated                    1.000000
Income                       52468.148200
NumDealsPurchases                2.318841
NumWebPurchases                  4.150070
NumCatalogPurchases              2.705937
NumStorePurchases                5.913978
NumWebVisitsMonth                5.278635
Response                         0.153343
Complain                         0.009350
Recency                         49.119682
Tot_Family_Mem                   2.344554
Name: 0, dtype: float64

Cluster 1:
Out[51]:
1
index 1.000000
Total_Spending 81.796296
Total_Num_Purchases 5.222222
Amount_Spent_Per_Purchase 12.854030
Customer_Tenure 4267.203704
Age 47.537037
Total_Kids 0.722222
Single 0.370370
Well_Educated 0.000000
Income 20306.259259
NumDealsPurchases 1.796296
NumWebPurchases 1.888889
NumCatalogPurchases 0.481481
NumStorePurchases 2.851852
NumWebVisitsMonth 6.870370
Response 0.037037
Complain 0.000000
Recency 48.444444
Tot_Family_Mem 2.092593

DBSCAN Cluster 0: This cluster exhibits a higher average total spending, total number of purchases, and amount spent per purchase compared to Cluster 1. Members of this cluster also tend to have a higher income and more web purchases. They also show a slightly higher propensity to complain.

DBSCAN Cluster 1: This cluster shows lower average spending, total purchases, and amount spent per purchase. Cluster members in this group tend to have lower incomes and fewer web purchases than those in Cluster 0. They also appear less likely to complain.

Gaussian Mixture Model¶

Visualize the clusters using PCA¶

Cluster Profiling¶

In [52]:
# Find the optimal number of components for GMM using BIC
n_components_range = range(2, 11)
lowest_bic = np.infty
bic = []
for n_components in n_components_range:
    # Fit a Gaussian mixture with n_components components
    gmm = GaussianMixture(n_components=n_components, random_state=42)
    gmm.fit(scaled_segmentation_df)
    bic.append(gmm.bic(scaled_segmentation_df))
    if bic[-1] < lowest_bic:
        lowest_bic = bic[-1]
        best_gmm = gmm

best_n_components = bic.index(lowest_bic) + 2  # Adjusted index for correct number of components

print(f"Optimal number of components: {best_n_components}")


# Fit the GMM model with optimal number of components
gmm = GaussianMixture(n_components=best_n_components, random_state=42)
gmm_labels = gmm.fit_predict(scaled_segmentation_df)


#Now use 'gmm_labels' for further analysis like cluster profiling

# ### **Apply GMM for the best hyperparameter and visualize the clusters from PCA**

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(scaled_segmentation_df)

# Visualize the clusters
plt.figure(figsize=(10, 8))
for cluster in np.unique(gmm_labels):
    plt.scatter(pca_result[gmm_labels == cluster, 0], pca_result[gmm_labels == cluster, 1], label=f'Cluster {cluster}')

plt.title('GMM Clustering with PCA Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

# Calculate cluster profiles for GMM clustering
cluster_profiles_gmm = df.groupby(gmm_labels).agg({
    'Total_Spending': 'mean',
    'Total_Num_Purchases': 'mean',
    'Amount_Spent_Per_Purchase': 'mean',
    'Customer_Tenure': 'mean',
    'Age': 'mean',
    'Total_Kids': 'mean',
    'Single': 'mean',
    'Well_Educated': 'mean',
    'Income': 'mean',
    'NumDealsPurchases': 'mean',
    'NumWebPurchases': 'mean',
    'NumCatalogPurchases': 'mean',
    'NumStorePurchases': 'mean',
    'NumWebVisitsMonth': 'mean',
    'Response': 'mean',
    'Complain': 'mean',
    'Recency': 'mean',
    'Tot_Family_Mem': 'mean'
}).reset_index()

# Plotting cluster profiles for GMM clustering
num_features = len(cluster_profiles_gmm.columns) - 1
num_rows = int(np.ceil(num_features / 3))
num_cols = 3

fig, axes = plt.subplots(num_rows, num_cols, figsize=(18, num_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(cluster_profiles_gmm.columns[1:]):
    sns.barplot(x='index', y=feature, data=cluster_profiles_gmm, ax=axes[i])
    axes[i].set_title(f'Average {feature} per Cluster (GMM)')

for i in range(num_features, num_rows * num_cols):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()
Optimal number of components: 10
No description has been provided for this image
No description has been provided for this image

Observations and Insights: The Gaussian Mixture Model (GMM) identified 10 as the optimal number of components based on the Bayesian Information Criterion (BIC). The PCA visualization shows 10 clusters that attempt to follow the aforementioned stripes. Cluster profiles reveal key differences between these segments:

  • The usual values like Age, Tenure and Recency show little effect.
  • Total Family Members show minimal effect just like previous clustering.
  • The rest of the features show lots of effect with many correlating with the others.
  • DBSCAN still takes the win for best clustering. GMM is next in line for the best spot since it captures those stripes to some degree.

Further analysis of the cluster profiles should provide actionable insights into the characteristics of these customer segments. Further data engineering may be required to capture those stripes.

Characteristics of each cluster¶

Summary of each cluster: Cluster 0: Likely a moderate spending group, with average purchase frequency and income.

Cluster 1: Potentially a high-value customer segment with above-average spending and purchase frequency. May be more responsive to marketing campaigns and prefer specific channels.

Cluster 2: Could represent a lower spending group, potentially less engaged with the brand. Demographics and purchasing channel preferences could reveal differences in their behavior.

Cluster 3: Likely another moderate-spending group, but with a distinct profile from Cluster 0. A comparison of these two clusters across features would reveal specific differences.

Cluster 4: Not a very niche segment with unique characteristics. Further investigation into its demographics, spending patterns, and preferences to fully characterize it by gathering more data.

Cluster 5: May be similar to Cluster 2 or 3, but with subtle differences. A side-by-side comparison would help.

Cluster 6: Could represent another high-value segment, potentially distinct from Cluster 1 in specific purchasing behavior or demographics.

Cluster 7: Might represent a segment with a low frequency of purchases but moderate or high spending per transaction. Further investigation would help.

Cluster 8: Potentially a small and unique segment. Deeper analysis could better define its characteristics definitively.

Cluster 9: Another likely moderate-spending group. Comparison with other moderate-spending clusters will distinguish its profile.

Overall the clusters are not as defined as with DBSCAN.

Conclusion and Recommendations¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

2. Refined insights:

  • What are the most meaningful insights from the data relevant to the problem?

3. Proposal for the final solution design:

  • What model do you propose to be adopted? Why is this the best solution to adopt?

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

The notebook explores five clustering techniques: K-Means, K-Medoids, Hierarchical, DBSCAN, and Gaussian Mixture Models (GMM). While the project prompt doesn't explicitly state the primary success metric, the visual inspection of cluster separations in PCA-reduced space, coupled with the silhouette score, suggests that DBSCAN performed best. The notebook demonstrates that DBSCAN achieved the highest silhouette score (around 0.575) and effectively separated the data into two distinct clusters as confirmed visually. GMM identified 10 clusters, capturing more granular groupings visually, but the overall cluster definitions were not as distinct or well-separated as DBSCAN. K-Means and K-medoids were not as successful in capturing the data's structure visually and numerically, according to the clustering profiles, and the notebook didn't provide their numerical results. Performance generally could be further improved by exploring a wider range of hyperparameters for each algorithm, especially for DBSCAN, but the provided expanded parameter search didn't improve results much, and potentially by feature engineering to address the visually apparent "stripes" in the PCA data that current methods don't fully capture.

2. Refined insights:

The most meaningful insights revolve around the customer segmentation revealed by the clustering. The two primary clusters identified by DBSCAN show distinct characteristics:

  • High-value customers (DBSCAN Cluster 0): Higher spending, more frequent purchases, higher income, and more web purchases. Slightly higher propensity to complain.
  • Lower-value customers (DBSCAN Cluster 1): Lower spending, fewer purchases, lower income, and fewer web purchases. Less likely to complain.

Similar trends were noted, but with less defined separation, in the GMM clusters. While the project briefly discusses other potential characteristics based on the GMM results, the two main clusters identified by DBSCAN provided the clearest insights. Further analysis of specific marketing campaigns on these two primary clusters would be necessary to validate their distinct response behavior. The visual observation of "stripes" in the data suggests additional, more granular segments may exist.

3. Proposal for the final solution design:

The proposed final solution is to use DBSCAN for customer segmentation due to its superior performance and clear cluster separation. While the GMM model showed some promise by capturing visually-apparent finer grain groupings, DBSCAN yielded the clearest high-level segmentation, offering more immediately actionable insights into two primary customer groups with distinct spending and purchasing patterns. Improvements can be made via further hyperparameter tuning and, more importantly, through extensive feature engineering to address the visible striping, allowing for the modeling to capture more granular customer subgroups. This should further enhance the value derived from the customer segmentation process.